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A new general framework is presented for implementing complex a priori knowl- 
edge, having in mind especially situations where the number of available training 
data is small compared to the complexity of the learning task. A priori information 
is hereby decomposed into simple components represented by quadratic building 
blocks (quadratic concepts) which are then combined by conjunctions and disjunc- 
tions to built more complex, problem specific error functionals. While conjunction 
of quadratic concepts leads to classical quadratic regularization functionals, disjunc- 
tions, representing ambiguous priors, result in non-convex error functionals. These 
go beyond classical quadratic regularization approaches and correspond, in Bayesian 
interpretation, to non-gaussian processes. Numerical examples show that the re- 
sulting stationarity equations, despite being in general nonlinear, inhomogeneous 
(integro-) differential equations, are not necessarily difficult to solve. Appendix [a| 
relates the formalism of statistical mechanics to statistics and Appendix pi describes 
the framework of Bayesian decision theory. 
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1. Introduction 

In the setting of empirical learning training data are used to predict the 
outcome of future test situations. The principal problem of empirical learning 
can already be seen in a noise free toy example: Assume we have measured h(x\) 
= 2 (training data) and have to predict h{x2) (test data). Clearly, this is a 
hopeless task unless we know some relations between h{x\) and h(x2) If> 
however, we know a relation, for example h(x%) — h(xi) = 5, the task becomes 
solvable. Such relations do not have the form of standard training data and must 
be provided by what we will call a priori information. 

The simple example clarifies two points which are also valid for more complex 
scenarios: 1. learning is technically merely a reformulation of available (training 
and a priori) knowledge, and 2. the success of learning is essentially based on 
empirical control of the implied dependencies between test and training data. 

One may now distinguish two principal possibilities of implementing a priori 
information in learning systems: 1. The dependencies between test and training 
data can be implemented by restricting the space of possible functions h(x), for 
example by choosing a parameterized form for h(x). Examples include linear re- 
gression models or (finite) neural networks, hereby it is often difficult to interpret 



such implemented priors in terms of dependencies of function values h(x) [7C]. 
2. Dependencies between test and training data may also be directly expressed 
in terms of the functions values h(x) itself, like we have done in the above toy 
example. This is the approach used for regularization terms in empirical risk 
minimization and for stochastic prior processes in Bayesian statistics. Here, for 
example, a smoothness term like 



dx 

within an error functional can provide the necessary dependencies. 

To control and adapt the generalization behavior of learning it seems helpful 
to treat a priori information as explicit as possible. We choose in this paper 
therefore the second possibility of explicit implementation of a priori information 
in terms of the function values h(x). 



4 



Lemm, J. C. / How to Implement A Priori Information 



Furthermore, we consider especially low data situations where the number 
of available training data is small compared to the complexity of h. We do this 
because those are the cases where a priori information becomes the essential 
input. Contrasting the well known uninformative priors of Bayesian statistics ||] 
one may call this a situation with informative priors. 

As a typical problem, consider an image completion task where only some 
of the pixels are given and the complete image has to be completed. If we expect 
for example the image of a face then a priori information to be implemented 
includes the expectation that a face has two eyes, a mouth and a nose and typical 
distances between that constituents. Similar problems are times series predictions 
with a priori informations concerning typically expected relatively well defined 
structures. Informative priors are also useful in object recognition or pattern 
classification when the object is relatively complex compared to the number of 
available training data. For a face detector, for example, a priori information can 
be implemented in form of a crude a priori model of what a face is which is then 
refined by the available training data. 

In practice, a priori information is usually given in qualitative rather than in 
quantitative form. The approach used in this paper assumes a priori information 
stated in form of logical statements like h has (probably) property A AND B (e.g., 
a face has eyes AND mouth) or property A OR B (eyes may be open OR closed) . 
In a first step properties of h are quantified by so called quadratic concepts 
introduced in Section ^. Quadratic concepts consist hereby of a prototype or 
function template t for h, and a corresponding distance measure defining \ \t — 
h\\. In a second step an error functional is constructed implementing the logical 
statements which represent the given a priori information. This is done in Section 
||. An optimal approximation is finally found by minimizing the error functional 
(Section |). 

Combining quadratic concepts by AND (Section |3.1| ) yields well known 
quadratic regularization functionals. Not common is the explicit inclusion of 'con- 
tinuous data' or template functions t representing approximate reference models 
for h. 

OR-like combinations, however, give rise to non-convex error functionals 
going beyond classical regularization approaches (Section [T^). Numerically es- 
pecially interesting are OR-like combinations of quadratic concepts with equal 
distance and only differing template functions. Those can be treated without cal- 
culating normalization factors which are difficult to obtain. This is exemplified 
in Section 3.3. Section [O] discusses an alternative implementation of OR-like 
combinations. 

In case the number of properties combined by OR is too large to be treated 
exactly additional approximations are necessary which are discussed in Section 
[D]. This is for example the case if an expected structure can be arbitrarily trans- 
lated or otherwise continuously transformed. An eye, for example, can appear 
in different shapes/scales/positions and only those variants most consistent with 
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the training data will then be included explicitly. 

The proposed approach provides a interface between symbolic methods and 
statistics^: A priori information is decomposed by symbolic/logical operations 
into components, simple enough to be quantified in form of quadratic concepts. 
Quadratic concepts are then combined, modeling the symbolic operations, result- 
ing in an error functional which is then treated by non-symbolic methods. 

Finally, let us add three remarks concerning the numerical requirements of 
the approach, the language of statistical physics, and the interpretation of error 
functionals. 

Explicit implementation of a priori information is in generally numerically 
extensive. Due to increasing computational resources numerical methods used in 
statistical physics or field theory, e.g., Monte Carlo calculations, become more and 
more applicable to learning problems. Up to now this is mainly the case for one or 
two dimensional problems, like for example in Bayesian image reconstruction 53, 
p2| , |6S| ,|74[| . In particular, the stationarity equations to be solved to minimize the 
presented error functionals can in general be nonlinear inhomogeneous integro- 
differential equations. They are however not necessarily difficult to solve as can 
be seen in Section The reason is that the nonlinearity, in particular the 
number of minima, is under explicit control and there are limiting cases (at 'high 
and low temperatures') where the equations become linear. In any case, the 
presented error functionals can be utilized as a well defined starting point for 
further approximations. 

Due to the background of the author the paper is written mainly in the 
language of statistical mechanics. Of course, there exists alternative formulations 
such as function approximation theory or empirical processes, which might be 
better suited for some purposes. The advantage of using a statistical mechanics 
formulation, however, is that it provides easy contact to approximations (e.g., 
saddle point approximation, perturbation theory, high temperature expansions 
[29,4^,^]) and numerical algorithms (e.g., Monte Carlo algorithms ^,^^1,48]) 
well known from statistical mechanics or statistical field theory and especially 
useful for large systems. In Appendix |A| the language of statistics is related to 
that of statistical mechanics. 

Like regularization approaches in empirical risk minimization we use an er- 
ror functional to find an optimal approximation. We remark, however, that the 
error functional has a different interpretation from the view point of empirical 
risk minimization and that of Bayesian statistics. In empirical risk minimization 
the error functional represents the regularized form of an empirical risk with data 
terms being an empirical estimate of an expected risk. In the Bayesian interpreta- 
tion the error function is related to the negative logarithm of the posterior density 
of h given the training data. Minimizing the error function is then equivalent to 
a maximum a posteriori approximation under log-loss. As we use the Bayesian 
interpretation to built up error functionals the necessary background is provided 
in Appendix |B|. 
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2. Quadratic concepts 

2.1. Definitions 

In this Section quadratic concepts are defined as fundamental building blocks 
to construct regularization functionals. We concentrate on function approxima- 
tion or regression problems. Hereby an unknown function or true state of Nature 
Hn is approximated by a function h chosen from a model space Tt using train- 
ing data Dt = {(xi,Ui)\i < 1 < n} = (x T ,yj,). Training data are assumed to 
consist of n pairs of independent variables x (describing the kind of measure- 
ment performed) and dependent variables y (responses or measured values). In 
classical regularization theory |p4| , pq| the error to be minimized is given by a reg- 
ularization functional E(h) (see Appendix [B] for a Bayesian interpretation) which 
contains the mean-square error terms for training data and an additional prior 
term, in many cases related to smoothness. We denote prior information by Dq 
so D = DtDDq represents all available data we have. Application to density esti- 
mation problems (having an additional normalization constraint and logarithmic 
terms replacing the mean-square data terms) or classification problems (being 
a special case of function approximation with integer y) poses no principal new 
problems and will be reported elsewhere. 

Example 1 (Image completion) Consider an image completion or image recon- 
struction task |)^,^,[3^,[74|] . Hereby an image h is drawn from a set of images 
TL. Then, given noisy observations yi{xi) for some pixels X{ of h a reconstructed 
and completed image y = h{x) € H should be returned by the learning system. 
For grey level images h is a vector of real numbers containing a two dimensional 
ni x ri2 array of grey level values g(k,l), i.e., with j-components hj = g(k,l) 
with 1 < k < m, 1 < I < ri2, 1 < j < n\U2 and, for example, j = k + n\(l — 1). 
Assume now one knows that the image is that of a face. For the class of faces 
experts can contribute information by verbally describing prototypical forms of 
constituents (e.g. eyes, nose, mouth), their variants (e.g. open vs. closed mouth, 
translated, scaled, or deformed eyes) and relations (e.g. typical spatial distances). 
The related concepts like 'eye' or 'typical distance between eyes', are here lin- 
guistic or 'fuzzy' variables ]36| l which must be quantified to enter a regularization 
functional. As prototype or template t of an 'eye' an image or drawing of an eye 
can be chosen. A distance d(t, h) between a reconstructed image h and an 'eye' 
template t can be defined pixel-wise, e.g. by 

d 2 (t,h)=Y / (Hx)-t(x)) 2 , (2) 

X 

or, more generally, by a real symmetric, positive definite kernel 

d 2 K (t, h) = J2( h (x) ~ t(x))K(x, x')(h(x') - t(x')). (3) 
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The sum over pixels x may be restricted to regions where eyes are expected. Fur- 
thermore, eyes may be open OR closed, appear in different sizes and at varying 
locations. Hence an eye is represented not by a single template but by a set of 
templates describing the variants in which an eye can appear. For continuous 
variations we will call such a set a deformable or adaptive template. It is de- 
scribed, for example, by scaling, translation, or deformation parameters. Then 
the image of a face, incomplete and disturbed by noise, should be reconstructed 
by approximating the given noisy data points of the incomplete image AND typi- 
cal constituents of a face and their spatial relations in either one of their variants. 
Thus, a missing eye should be reconstructed at a typically expected position de- 
pending, for example, on the approximated position of mouth and nose, and in 
a form depending on the form of another, possibly visible eye. 

The model treated in Section (|3.3| ) represents a simple example for such a 
task (for a one-dimensional image or time series, respectively). For comparison, 
we discuss shortly a possible use of template functions for classification tasks (not 
treated in this paper): 

Example 2 (Face detection) Consider a face detection task. Hereby the inde- 
pendent variable x is drawn from a set of images Xr by an unknown mechanism 
which should be approximated by a function h(x). Then, given an image x E Xr 
the output y £ {0, 1} of the learning system should indicate whether it is the 
image of a face, e.g. by y = h(x) = 1, or not, e.g. by y = h(x) = 0. In con- 
trast to the previous example now x is a vector of real numbers containing a two 
dimensional array of grey level values. A function template t now represents a 
prototoype for the classification function h. Classification templates tj entering 
the error functional can now, for example, be constructed by reference to image 
templates t x representing prototypical face/non-face input vectors x or parts of 
it. Thus, E(h) = E(h,{tj}) where classification templates tj are defined with 
the help of image templates tj = tj(x,{tf}). For example face templates t x for 
images x can be constructed as described in the previous example. Then, in a 
second step statements like: 'A face has eyes, nose and mouth in either one of 
several variations' are expressed in terms of distances df from image x to image 
templates t x . A classification templates t could for example be a simple threshold 
function responding with t({df }) = 1 if some function of distances {df } to typical 
faces are below a certain threshold. Finally, distances d(h, t) between classifica- 
tion functions h £ H have to be chosen to build an error functional E{h) to be 
minimized. (We refer to Appendix |B.6| for the distinction between specifying a 
reconstruction model, i.e., the probability of face vs. non-face given an image, 
and specifying a generative model, i.e., the probability of an image provided it 
represents a face or non-face.) 

In general, a concept is based upon 
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1. a template t(x) representing a prototype for function h and 

2. a distance d(h, t) measuring the similarity of a function h to the template t. 
For practical reasons, a quadratic distance is especially convenient. Concepts 
with quadratic distances will be called quadratic concepts. 

3. A template t can be restricted to a subset of Xr with the help of a projection 
operator. 

4. Complex concepts are constructed as combination of quadratic concepts. In 
this paper variants of AND and OR-like combinations are used. 

Now we come to the formal definition of a quadratic concept. Let H be 
a Hilbert space of possible hypothesis functions h, and let angular brackets de- 
note scalar products and matrix elements of symmetric operators, e.g. (t\h) = 
jd d xt(x)h(x) and (t\K\h) = Jd d x d d x't{x)K{ x,x')h(x') for c?-dimensional x. 
(We will often write in the following simply J dx also for d-dimensional integrals.) 
Recall also that a real operator K is positive definite (semi positive definite) if it 
can be diagonalized, i.e., in terms of eigenfunctions Q k of K (with transpose 

tf = 5>k*k3fc. (4) 

k 

and all eigenvalues A^ > (X k > 0) are positive (non-negative). Thus, K = 
O t DkO with O orthogonal (i.e., O t O = I with identity / and transpose O t ) 
and diagonal operator Dk > (Dk > 0). Positive definite operators define a 
scalar product by (t\h) K = (t \K\ h). Positive (semi) definite operators K can be 
decomposed K = W T W = (OW) T (OW), with real W, invertible if K positive 
definite, and arbitrary orthogonal O. A quadratic concept is defined as follows: 

Definition 1 (Quadratic concept) A quadratic concept is a pair (t,K) consisting 
of a template function t(x) G TC and a real symmetric, positive semi-definite op- 
erator, the concept operator K with eigenfunctions (features) and eigenvalues 
(feature weights) We call W a concept filter if K = W T W . The operator K 
defines a concept distance: 

d 2 (h) = d 2 K (t,h) = (h-t\K\h-t) = \\h-t\\ 2 K (5) 
= (W(h-t)\W(h-t)) = \\W(h-t)\\ 2 (6) 
= X k (h - t\* k ) {$ k \h - t> = 5>* II {h\* k ) - {t\* k ) || 2 , (7) 

k k 

on subspaces where it is positive definite. The maximal subspace in which the 
positive semi-definite K is positive definite is the concept space Hk of K . The 
corresponding hermitian projector Pk in this subspace Hk is the concept projec- 
tor. 



Lemm, J. C. / How to Implement A Priori Information 



9 



There exists a correspondence between quadratic forms and gaussian pro- 
cesses [58,43,48]. Let x be elements of (the dual of) Tt, i.e., h(x) — t(x) = (x\h — t) 



K~ x x 



K 



h-tj = {K~ x x\h - t) K , is a bounded functional according to the 
Riesz representation theorem. In the context of stochastic processes TL is known 
as reproducing kernel Hilbert space for x with reproducing kernel K^ 1 [p8[| . (The 
term reproducing kernel Hilbert space does not name a certain subclass of Hilbert 
spaces. Indeed, all separable Hilbert spaces are isomorphic for equal dimension. 
It characterizes the representation by the coordinates x. For example, the space 
£2 of square integrable functions h(x) is not a reproducing kernel Hilbert space 
with respect to the coordinates x. This is reflected by the fact that functions 
h[x) £ £2 are not even defined pointwise.) For reproducing Hilbert spaces the 
scalar product with kernel K can be related to a covariance of zero mean gaussian 
variables. Interpreting t as data y D obtained in situation x D = K (i.e., measuring 
features weighted by A&) we write 

P (y D \x D ,h) = P (t\K,h) oce-<W)/ 2 (8) 

for a distribution of t with mean h and covariance operator K~ l . (For a inter- 
pretation as posterior p(h\D) see Section |2^ .) The kernel function K~ 1 (x,x f ) of 
the covariance operator is also known as Green's function of K, propagator, or 
two-point correlation function. Hence, an error or energy functional 

E(k) = (9) 

corresponds up to a constant to a negative log-probability. For approximation 
problems minimizing an error functional E(h) can, from a Bayesian point of view, 
be interpreted as a maximimum-a-posteriori approximation (see Appendix p.6| ). 

We remark that we will not discuss in the following problems of infinite 
spaces. This holds especially for the nonlinear models in the next sections for 
which the question of a continuum limit is highly non-trivial pO , 75 1 . Hence, if 



necessary, integrals can in the following be considered as convenient notation 
for sums. Analogously, derivative operators can be replaced by their discretized 
lattice versions. This reflects also the fact that finally numerical calculations have 
to be done in a finite dimensional space. 

The next two examples show that the definition of a quadratic concept 
includes the standard regularization functionals. The first example is a discrete 
concept with 'trivial' concept distance, the second a continuous concept with 
'trivial' template function. 

Example 3 (Data template) A standard mean-square error term used for re- 
gression is 

(yj - h( Xj )) 2 = f°° d d x 5{x - Xj)(h(x) - tj(x)) 2 = (h - tj Pj h - . (10) 
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Thus, such a mean-square error term corresponds to a quadratic concept with 
concept operator with kernel Pj(x,x') = 8(x — Xj)5(x — x') and (data) template 
tj is the constant function tj(x) = yj. The measured features are h(x) = (h\x). 



Example 4 (Prior template) A typical smoothness functional in regularization 
theory, corresponding to the Wiener measure for stochastic processes l^,20|,32p3] 



and to the kinetic energy or a free massless scalar Euclidean field in physics 



IS 



d d x^<^y =-(h- to \A\h-t ). (ii) 

Here partial integration has been used under the assumption of vanishing bound- 
ary terms. This quadratic concept has the zero function to(x) = as template. It 
is a sum of terms with concept filters d/dx\ which are the generators of infinites- 
imal translations in d dimensions. Hence, the functional represents a measure of 
approximate infinitesimal translational symmetry. The concept operator is the 
negative (semi) definite <i-dimensional laplacian with kernel 

A(x,x')=5(x-x')J2^. (12) 
l=i ox i 



In the following we will mainly be interested in the non-standard case of a 
combination of discrete training data with several prior concepts with non-zero 
template functions. Firstly, we discuss in more detail the two main ingredients 
of a concept: templates and distances. 

2. 2. Templates 

Template functions can be constructed in various ways. The following list 
gives some potential applications of template functions in different contexts: 

1. (Direct construction by experts) A template can be directly constructed by 
experts. For financial time series, for example, often an expected trend is 
included p8fl , 

2. (Combination and extension of arbitrary learning methods) More generally, a 
template t(x) can be the output of an arbitrary, parametric or non-parametric 
learning method trained for the same problem. For example, the prototype 
t(x) can be a rule-based expert system, a regression tree, a neural network or 
a simple linear regression (for a collection and comparison of methods see for 
example ]50[]). B ecause a prior concept can depend on training data Dt (see 
Appendix |B.4.1| ) such a t{x) can be obtained by training with the same data 
Dt which are also used to determine the optimal approximation h* . If one 
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wants t(x) to be independent of the training data Dt it can alternatively be 
obtained by training with an independent set of training data. 

3. (Transfer) The template t(x) can also be the result of a learning algorithm for 
a similar problem and therefore be used to transfer knowledge between tasks. 
Such a transfer template can be adapted and restricted to certain subspaces 
by using concept projectors. The new solution h adapts the transfer template 
t to the new situation according to the new training data and other, additional 
prior templates. 

4. (Learning history) In all cases a template t(x) can be seen to contain in a 
compressed form the learning history prior to the new training data. This 
allows for example to construct on-line learning procedures. Hereby an inter- 
mediate solution h is obtained by using only a part D\ C Dt of the available 
training data Dt- Then the intermediate solution h is chosen as additional 
template t and learning is continued with a new data set D2 C Dt- 

5. (Sampling) A template corresponds to the mean of a gaussian process. For 
a finite number of x it can therefore be approximated by a sample mean, if 
samples for h(x) are available. This can be, for example, a set of complete 
images (or images of constituents) for usage in image reconstruction. Thus, let 
t a denote a sample for (a part of) h(x). Then, if p(h) can be approximated by 
a gaussian, the empirical mean t = J2 a t a is a natural candidate for template. 
Similarly, multimodal distributions p(h) can be approximated by a mixture 
of gaussians. In that case not only one but a set of centers, i.e., templates ti 
has to be chosen (see Section ||). The centers t{ can be obtained by clustering 
methods. 

Clearly, these fields have long publication histories and the advantages and 
disadvantages of using templates still have to be investigated. In this paper we 
concentrate mainly on the first possibility. 

2.3. Covariances and symmetries 

The concept kernel, respectively its inverse the covariance operator, are often 
related to approximate symmetries of a problem. Sometimes, also a finite rank 
approximation can be obtained by sampling. 

Frequently prior information has the form of symmetries which h(x) has to 
fulfil, exactly or approximately. We already have seen in Example ^| that approx- 
imate symmetry under infinitesimal translations is related to smoothness. The 
implementation of approximate symmetries requires the definition of a distance 
to exact symmetry. 

We can write for a positive (semi) definite concept operator K = W T W = 
(I — (W — I)) T (I — (W — I)) with identity /, and concept filter W. Now consider 
operators S which just change the argument x of h(x) into o~(x), i.e., 



Sh(x) = h{a(x)). 



(13) 
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If the transformation &(x) is one-to-one then S permutes the function values and 
S is a symmetry transformation. Hence, for K = (7 — S) T {I — S), i.e., S = W — I 
and a zero template t = the corresponding squared ET-norm 

= (h\K\h) = (h-Sh\h-Sh) = \\h - Sh\\j = d 2 s (h) (14) 

compares h with Sh and measures therefore a degree of symmetry. 

A bit more generally, we call a concept operator K = (I — S^i^s^I — 5) a 
symmetry concept operator, if S is a symmetry operator. Here K$ can be some 
positive (semi-) definite operator usually taken equal to the identity I, and the 
hermitian conjugate S 1 ' is equal to the transpose S T for real matrices. With zero 
template we have 

4 = (h- Sh\K s \h- Sh) = (h\(I - S)^K S (I - S)\hV (15) 

Continuous symmetries are represented by Lie groups which locally can be written 
as exponential of m generators s = (s%, ■ ■ ■ , s m ) parameterized by a vector 9 

S(9) = e^ = J2^(0\s) k . (16) 
fe=o K - 

using the scalar product notation (6\s) = J2j @j s j a ^ so f° r vectors of matrices or 
operators s. The infinitesimal generators s form the corresponding Lie algebra. 
In particular, the group of d-dimensional translations is generated by the gradient 
operator V. This can be verified by recalling the multidimensional Taylor formula 
for expansion of h at x 

S(6)h(x) = e^ v) h(x) = ^Tf- h (x) = h(% + (17) 

Up to first order the expansion of the exponential function reads Sb1| 9s. 
Thus, we can define a distance to an infinitesimal symmetry by 

,9/^ /h-(l + 9s)h „ h-(l + 9s)h\ 

d s(h) = ( — Q K s L_ ^ = (^1^1^), (18) 

with infinitesimal symmetry operator K = s^K s s. For translations and K s equal 
to the identity I this results exactly in (|TT]) . 

In cases where samples t a for h are available, like sometimes in image re- 
construction or in times series prediction, a finite rank approximation K^ 1 of 



K- 1 (x,x') = J dhp(h)h(x)h{x') (19) 
can be obtained by the empirical estimator |55| 

k- 1 (x,x')=J2Ux)ta(x'). (20) 
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An optimal h* in the subspace where K^ 1 is positive definite and therefore in- 
vertible can be found by singular value decomposition. Alternatively, more prior 
concepts with concepts operators K\ can be added so the sum J2 i becomes 
strictly positive definite. 

2-4- From posterior probability to likelihood energy 

In this Section we discuss the interpretation of quadratic concepts and error 
functionals in terms of posterior probability and likelihoods (see also Appendix 
^). We will need this interpretation in the following to combine quadratic con- 
cepts to more complex error functionals. 

Typically, hypotheses h are defined by specifying the probabilities p(y D \x D ,h) 
of finding data y D in situations x D under h. These data generating probabilities 
p(y D \x D , h), or (a^-conditional) likelihoods of h under y D , are related to posterior 
probabilities p(h\D) we are interested in according to Bayes' rule 

p (h\D) = pC-^pW = p(yp\ x D> h )p( h ) pi) 

P(D) p(y D \ 

X D ) 

with 

p{Vd\ x d) = I dhp(y D \x D ,h)p(h), (22) 
Jn 

assuming p(h\x D ) = p(h) and shorthand notation 



[ dh--- = f Y[dh(x)--- = f dh(xi) I dh(x 2 ) ■ ■ ■ = [n [ dh(x)\ ■ ■ ■ . 

(23) 

The terms of Eq.(|2l|) are in the Bayesian context often referred to as 

likelihood * prior 

posterior = . (24) 

evidence 

For complete data which means for uniform (possibly improper, i.e., non- 
normalizable) p(/i)f] yields (see also Appendices B.1.1B~4~T| identifying h with h 
according to Appendix |~~ 



Being interested in the dependency on h for given data D we can skip the h— 
independent factor p{D) or J dhp(D\h) and calculate instead of p(h\D) the inverse 
p(D\h). For the special case (||) of one quadratic concept with template t, so 

1 The probability p(h) is known as prior probability of h. In this paper we treat prior information 
explicitly and analogous to standard training data. That means we assume p(h) to be fa- 
independent and collect all prior information in Do- They enter the formalism through p(h\Do) 
or p(D \h), respectively. 
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y D = t and x D corresponds to K, we find for the denominator in ( pq) because of 
the translational invariance of the gaussian measure 

, fdhe~ d2 / 2 fdhe- : t( h - t \ K \ h - t ) 

Z L = dh P (t\K, h) = jdh 6 _ d 2 = Jdhe = 1, (26) 

J Jdte d I 2 j dte -±(h-t\K\h-t) 

and thus p(h\D) =p(y D \x D ,h). 

To obtain error functionals we will now express probabilities in terms of 
energies and related quantities (see Appendix A.1.2 ). For example, the data 
generating probabilities or (x D -conditional) likelihoods of h can be written in 
terms of (x D -conditional) likelihood energies E(y D \x D ,h) and inverse likelihood 
temperature j3 L 

p(y»K,h) = e '"f^ Xh " = e -eMv D K*)-ny»\* D , h )) , (27) 

■6{yD\x D ,ri) 

and analogously the posterior probability p(h\D) becomes in terms of posterior 
energy E(h\D) and inverse posterior temperature (5 P 

p -/3 p E(h\D) . s 

mD) = z(h\d) = e-^m-nn\ D) ) (28) 

or in terms of likelihoods 

p{h) 1 e~ P L E{ - V D\ X D' h ) 

= e ~ p L ( E (y D \x D ,h)-F(y D \x D ,h)) +c L (h,D) ^ ^ 29 -j 

with 

Z L (h,D) = P ^ D \ c L (h,D)=lnp(h)-lnp(y D \x D ), (30) 
free energies, 

F(H\D) = --^-ln Z(H\D), F(y D \x D , h) = ~ In Z(y D \x D , h), (31) 
Pp Pl 

and normalization factors or partition sums 

Z{H\D) = [ dhe~Pp E W D \ Z(y D \x D ,h)= f dy D e-^ E (v D K> h ). (32) 
Jh Jy D 

Thus, the posterior energy can be expressed by the likelihood energy 

E(h\D) = F(H\D) + ^(E(y D \x D ,h) - F(y D \x D ,h)) + jc L (h,D). (33) 
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For complete data, i.e., /i-independent p(h) 

r r e ~PL E (y D \ x D> h ) 

Z L (h,D) = Z L (D) = J dhp(y D \x D ,h) = J dh . (34) 

In that case we have already seen that maximizing the likelihood p(y D \x D ,h) 
with respect to h is equivalent to maximizing the posterior p(h\D). However, 
minimizing the posterior energy is not necessarily equivalent to minimizing the 
(conditional) likelihood energy. This is due to the possibility of a /i-dependent 
normalization of the likelihood energy. If, however, in addition to complete data 
also the likelihood normalization is /i-independent, i.e., Z(yr)\x D , h) = Z(yr)\x D ), 
then 

= jdhe-^v B \-nM = Z(y D \x D ,H) 

L z(y D \x D ) z(y D \x D ) { > 

and the posterior becomes according to Eq.(p9|) 

-P L E(y D \x D ,h) 

p(h\D) = e —— — oc e -P L E(y D \* D ,h) : 36 

Z{y D \x D ,H) 

where Z(y D \x D ,7i) = Jdhe~^ E (yo\ x D' h ) is an integral over the conditional vari- 
able h. Thus, for complete data and h-independent likelihood normalization max- 
imizing the posterior is equivalent to minimizing the (x D -conditional) likelihood 
energy. 

Remark 1 (Mixed representation by likelihood and posterior energies) 

Up to here we expressed posterior energies completely by likelihood energies. 
It is often also useful, however, to express the posterior energy by a sum of 
likelihood energy and posterior energy terms. A prior term for example, may 
describe a h (and not data) generating process directly given by p(h\Do). Let the 
data D = (x D ,y D ) be therefore divided in measured data Di = (xl,Pl) which 
enter as likelihoods and a generating prior Dp = (x p ,y p ) describing the posterior 
of the /i-generating process. Thus, D = D^UDp and x D = XiUx p , y D = yi\Jy p . 
Then, according to Bayes' theorem and p(h\xi, Dp) = p{h\Dp) 

p{y L \x L ,h)p{h\D P ) _ e -t3 L E{y L \x L ,h)-l3 p E(h\D P )+lnp(y L \x L ,D P ) 

P{ 1 } ~ p(y L \x L ,D P ) ~ Z(Y L \x L ,h)Z(H\Dp) ' (3 } 

In this case, minimizing the mixed energy f3 L E(yi,\xL, h) + (3 p E(h\Dp) is equiv- 
alent to maximizing the posterior p(h\D) if the normalization of the likelihood 
terms Z(Yz j \xl, h) is /i-independent. 



3. Combinations of quadratic concepts 
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3.1. AND: Classical regularization and gaussian processes 

In the classical situation of regularization theory the aim is to approximate 
all available data, training data Dt as well as prior information Dq. In logical 
terms, the aim is to approximate Dq AND Dt x AND Dt 2 AND • • • D^ n - A logical 
AND of events corresponds to a product for probabilities p(A, B) = P{A)p{B\A) 
or p(A, B) = P(A)p(B) for independent events. A product for probabilities 
corresponds to a sum for log-probabilities. This holds also for concepts, if we 
interpret concepts as linear functions of log-probabilities, i.e., as energies (see 
Appendix |A|) or errors (see Appendix BJ3). In the following we will identify the 



events A, B, etc. with data in the form of template functions tj. 

Consider a set of quadratic concepts d? with templates T/v = {tj\l < j < N} 
and concept operators Kn = {Kj\l < j < N}. Assume we have data y D = t\ 
AND t 2 AND i.e., 

P (h\D)= P (h\T N ,K N )<xp(y D \x D ,h)=p(T N \K N ,h) (38) 

with E(T N \K,h) = E{y D \x D ,h) = EjE(tj\Kj,h) = £f d){h)j2. In writing 
p(tj\Kj, h) we suppressed the dependency on the temperature 1/P L . The normal- 
ization factor Z(T n \Kn,Ii) factorizes 

N 

Z(T N \K N ,h)=l[Z(T\K j ,h) (40) 

3 

with 

Z(T\K j7 h) = J njdtjix)^ e -Wi\K h h)_ (41) 

For quadratic concepts the integrals are gaussian and therefore independent of 
the mean h, i.e., 

Z(T N \K N ,h) = Z(T N \K N ). (42) 

Thus, according to the results of the last Section we may minimize the likeli- 
hood energy E(y D \x D , h) = E(Tn\Kn , h) instead of the posterior energy E(h\D). 
Writing now for simplicity E(T^\K^,h) = E(h), the following proposition is 
obtained by straightforward calculation: 
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Proposition 1 (Probabilistic AND) A sum of squared distances Ej = d 2 (h)/2 = 
d(tj, h)/2 can be written 



N 



1 N -j N 

E(h) = ]T Ej(h) = - ^ dpj, h ) = -J2(h-t 3 

3 3 3 



K, 



h-t 



N 



{h\K\h) + J2(tj K 3 tM-lhy^Kjt 



= \({h-t\K\h-t)+Y J (tj \Kj\ tj) - (t \K\ t) 



d 2 (0,h) ni . AT 



N / 

= Y (d 2 (t,h) + V 



d 2 (t,h) 



E n 



(43) 



as sum of one squared distance d 2 (t,h) from template average t and a h 
independent minimal energy E m i n . Squared distances axe defined as 

d 2 (0,h) = (h\K\h), 



d 2 (t,h) = (h-t\K\h-t), 



(44) 
(45) 



(P(t,h) 



d 2 (t,h) 
N 



(h-t\K\h-t) , 



with template average 



TV 



t = K~H, t = J2 Kjt. 



3 ' 



(46) 



(47) 



concept operators 



3 3 

and /i-independent minimal energy and template variance V 



with 



N o 
Emin(T N ) = y V, V = M 2 - M 2 , 



1 N 

M 2 = ^E(^N^'}> M 2 = (t\K\t) 



(48) 



(49) 



(50) 
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The linear stationarity equation for a functional E = d 2 (t,h)/2 + E m i n reads 

= K(t-h) =t- Kh. (51) 
For positive definite, i.e., invertible K, this has solution h = t = K~ l t. 

We see that for quadratic concepts addition does not lead to something really 
new: The sum of quadratic functions is a quadratic function, or in a probabilistic 
interpretation, a product of gaussians is a gaussian. It is also interesting to note 
that for such additive combinations all template functions ij corresponding to 
infinite data can be eliminated from the formalism by combining them in one term 
with template average t and solving for a shifted h' = h — t. This elimination 
of 'infinite data' templates will not be possible in the nonlinear combinations 
presented in the next sections. 

Remark 2 (Normalization) The normalization factor Z(T N \K^,h) to obtain 
p(Tn\Kn ,h) being a product of ^-dimensional gaussian integrals can be calcu- 



lated explicitly (see Eq.( 312 )) 



Z(T N \K N ,h) = n/ (n^(z)j e-+ d ?^ (52) 
N / N \ ~2 

= 11 (dettf,-)-*) =7T 9 -I3 L ~^ ( detail • (53) 

For gaussian integrals the exact solution (1531) coincides with a saddle point ap- 



proximation (see |B.5[) . Eq.(p3[) differs from a T-dependent normalization over 
h £ TC which would be necessary to obtain p(h\D) if we would interpret a sum of 
quadratic concepts as posterior energy E{h\D) = J2j E(h\tj, Kj) = J2jdj/2, 



ZCH\D)=Jdhe-^ E ^=J (n^le-^^ 



_ i 

N \ 2 



e 



-PpE min J dhe -p p ±Mh± = e -P P E mi n n ip p -i h et J2Kjj . (54) 

Denoting by (• • •) a gaussian average over h with covariance oc K~ l = Kj) -1 
we recognize the moment generating function for h(x) 

M{(3 p t) = e Pp E ™ n Z(H\D) = (ePp(*\ h ) 



Lemm, J. C. / How to Implement A Priori Information 



19 



E 



„ nl 



dxi--- dx n t(xi) ■ ■ ■ t(x n ) (h(xi) ■ ■ ■ h(x n )) . 



(55) 



Hence (functional) derivatives of M((5 p i) with respect to (5 P i(x) generate moments 
of h(x) (see Section [A, 3. 3 , in particular Wick's theorem |180| ), 



Remark 3 (Kernel methods) Practically important is the case where the sta- 
tionarity equation can be solved in a space with dimension h < n being the 
number of different x values in the training data. This can be much less then 
the space necessary for a reasonable discretization of the whole function h. To 
see this we consider the classical situation of mean-square data terms and one 
additional prior concept 

E(h) = (h-t T \K T \ h-t T ) + {h- t \K Q \ h - t ) (56) 

with 

n n 

(h - t T \K T \ h-t T ) + V = J2{h-t j \Kj\ h - tj) = £ ( Vj - h( Xj )) 2 . (57) 

j j 

Thus, the operator Kt = J2j Kj 1S diagonal, commutes with the projector Pt 
into the space Xt of training data and has matrix elements 

n n SI — \ 

KT(x,x')=5(x-x')^2^-x J ) = 5(x-x')5(0)n x , n x = ^ ° (X ~* j) , (58) 

containing, for discrete x where 5(0) = <5o,o = 1 represents a Kronecker-<5, the 
number n x of data for every x. For continuous x the factor 5(0) becomes infinite 
but will cancel in the following calculations. The data template 

,„ Y% 5(x - xj)yi(xj) 1 ^ 

tT = ^Kjtj, t T (x) = ^ ! » '7 3 =-Y,V3^) (59) 

is the average of y values for given x. Here K^ 1 is defined on Xt and tx(x) = 
for x not in the training data, i.e, with n x = 0. The stationarity equation 

= K T (h - t T ) + K (h - t ) 

yields 

h = K^a + t , (60) 

with a = Kx(tT — h). For known Kq 1 it is sufficient to solve 

a=[I + K T K 1 }- 1 [K T (tT-to)} (61) 

for a function defined on the space Xt with dimension TrP^ = n < n. The 
formulas can be adapted to the case where Kq has zero eigenvalues so it is not 
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Figure 1. Gaussian vs. two 'robust' gaussian mixtures: Left: Logarithm of one gaussian 
(parabola), Middle: Logarithm of mixture with two gaussians with different variance and equal 
mean (insensitive for large deviations), Right: Logarithm of mixture with three gaussians with 

different variances and different means. 




Figure 2. Three robust error functions which are insensitive to small errors. Left: Logarithm 
of mixture with two gaussians with equal variance and different means, Middle: Logarithm of 
mixture with f f gaussians with equal variance and different means, Right: e-insensitive error. 



invertible over the whole space [68]. Classical examples of such kernel methods 
are gaussian Radial Basis Functions which use as concept operator the pseudo- 
differential operator 

oo 2m 

^o=E(-l) m ^A"\ (62) 



m=0 



where A m is the m-iterated laplacian and which has a gaussian inverse. They 
also include piecewise linear interpolation (Kq = —A) and various spline methods 
(e.g., K = A 2 ) HME1. 



Remark 4 (Robust error functions and support vector machine) As a general- 
ization of quadratic concepts one can allow nonquadratic functions U of filtered 
differences W(h — t) |74j]. In the simplest case U is only applied to the data 
concept (|57| ) of a functional of the classical form (|56|). Then the nonlinear sta- 
tionarity equation can also be restricted to the n-dimensional space Xt- The 
data term can for example be replaced by a gaussian mixture model (see Section 
j3^ ). Robust error functions have flat regions and are therefore insensitive to, i.e., 
robust against, changes in that flat region. Also numerically flat regions in the 
error surface can be useful because there the gradient vanishes and those regions 
do not contribute to the stationarity equations. Robust error functions can for 
example down- weight large errors. Typical cases are filters used in image pro- 
cessing where edges represent large discontinuities (regions with low smoothness) 
but are relatively likely. Fig.[l| shows examples obtained by gaussian mixtures. 
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Another variant of robust error functions is insensitive to small errors. An exam- 
ple is an e-insensitive error (zero between ±e, linear outside, see Fig.Q) for the 
data term. For example, expanding h in a basis of eigenfunctions <]?/% of the prior 
concept operator 

K = h$k$L h(x) = J2 n k$k(x) (63) 

k k 

yields for functional ( |5G| ) 

E ( h ) =2 (l>fc*k(3t) - + J2 X k\ n k\ 2 - (64) 

i \ k ) k 

Replacing the mean-square error data term by an e-insensitive error results in a 
standard quadratic programming problem and is equivalent to Vapnik's support 
vector machine |67, 24, 62,63]. 

3.2. OR: Mixture models 

Assume we believe that a function can be similar to either one of two tem- 
plates. Often a list of possible alternatives can be given. For example, one may 
simply state that a face is that of a women or that of a man, eyes may usually be 
open or closed, or we may be able to give a list of possible pattern prototypes for 
a time series. For example, electrocardiograms, or similarly earthquakes, have 
typical patterns which can appear in distinguishable variants. Thus, we discuss 
here the case where we want to approximate t\ OR t2 OR ■ ■ ■ t n . For probabilities 
we have p(AOHB) = P{A) + p{B) — P(A,B) where the last term vanishes for 
exclusive events. For log-probabilities L of exclusive events this means 

L(AORB) = In (e L ^ + e L ^) . (65) 

Relating now concepts to log-probabilities by interpreting them as energies or 
errors, respectively, shows that alternative quadratic concepts lead to gaussian 
mixture models. 

To be specific, let us discuss the cases of discrete input, output and gener- 
ation noise (See Section B.2.4j ). Input noise means that the measurement device 



producing outcome y is not known exactly, i.e., 



p(y\x,h) = ^2p(y\xi,h)p(xi\x). (66) 

i 



For gaussian p(y\xi, h) hypothesis h defines the mean and the independent vari- 
able Xi defines the covariance K~ l / j3 L of the measurement producing y. Thus, 
model (|66|) leads to a mixture of gaussians with different covariances. Similarly, 
under output noise a measured value cannot be read off exactly, 

p(y\x,h) = J2p{y\yi)p(yi\x,h). (67) 
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Thus having found y the true measurement result was one of the yi. For gaus- 
sian p(yi\x,h) where yi is represented by a template function ij this leads to a 
likelihood for h being a mixture of gaussians with different means. In particular, 
consider a situation of ambiguous data where a set of yi cannot be distinguished 
by a measurement procedure so all yi lead to outcome y. In that case 

1 * 

p{y\yi) = w X 5 (y-yi)> ( 68 ) 

1 i 

so the likelihood becomes a simple sum over alternatives 

1 Ni 

p{y\x, h) = — ^2p(yi\x, h). (69) 
iV * i 

Generation noise on the other hand means that the /i-producing probability is 
modeled as a mixture 

P(h) = Y,p(i)p(h\i), (70) 

i 

where i could determine mean and covariance of a gaussian p(h\i). 

For a maximum posterior approximation we have to maximize the posterior 
density 

P{ h\D) = Pi h\D L , d p) = pMp^mM . (71 ) 

Including input, output, and generation noise the posterior becomes 

p{h\D) oc X)p(fc)p(/»|fc)p(j/|i)pO»P(»li. h), (72) 

ijk 

skipping the /i-independent factor p(y L \x L , Dp). We will especially consider the 
case where p(i\j, h) = ]\^ L p(ii\ji, h) factorizes into Nl independent components 

p(h\D) oc Y;P( k )p( h \ k )p(y\^j\x) \[p(il\3h h), (73) 

ijk I 

with i = {ii\0 < I < Nl} and j = {ji\0 < I < Nl}. In particular, if also input 
and output noise factorize, i.e., p(y\i)p(j\x) = Yli p(yi\ii)p(ji\xi), this would read 

p(h\D)<x^2 p( fc M^I^)Il^(^I^M^Ii«>%(i«k«) 

V H'ii' k 1 

= X]^( fc M^I^)IlX]^(y'I^M^|j«^M^k«)- ( 74 ) 

k I iiji 
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Choosing now quadratic concepts for /i-dependent (likelihood and generation) 
energies, Eq.(^) becomes 

p(h\D) oc p(k,m)p(h\K k ,t m )p(T\i)p(j\K)l[p(t il \K n ,h), (75) 

ijkm I 

where T = {i/|0 < I < Nl}, K = {Ki\0 < I < Nl}, and separate summa- 
tion variables for mean and covariance of the /i-generating process have been 
introduced. 

To find an error functional to be minimized we express the posterior in terms 
of energies and free energies, the latter determined by normalization factors. Two 
kinds of free energies have to be included in an error functional. 

1. Free energies depending on variables for which we want to maximize the poste- 
rior, may not be skipped. Thus, in general free energies of all likelihood terms 
have to be included. For the special case of gaussians, however, normalization 
of likelihood terms is /i-independent. 

2. Furthermore, it is often easier and more meaningful to specify instead of a 
joint probability, e.g., p(h,k), conditional and marginal probability, e.g., p{k) 
and p(h\k). In that case also free energies depending on summation variables 
i, j, k, m have to be included. Otherwise, such free energies would contribute 
to marginal energies like p(k), and terms like E{k) could not be interpreted as 
energy for marginal p(k). Systems specified by conditional energies are also 
known as disordered systems (see Section |A.4|) . 

Thus, in terms of energies, 

p{h\D) oc J2 e" ftE(fc) e"' ? f (E( ' l| ' !) " f(f<|t)) e" , ' ! ' (E(! ' |,) " F(j;|i)) 

ijk 

xe -4 Y,SE{n\juh)-F{X l \j l ,h)) e -(3 x {E{j\x)-F(J\x))^ ^ 

for %i e li, j e J, y e y, h € TL. In particular for quadratic concepts this 
becomes, choosing [3 P = P L = (3, 

ijkm 

(77) 

with pij oc p{T\i)p(j\K). Notice that despite p(T\i) is not normalized over i 
we could nevertheless choose p^ to be normalized over i and j by taking p^ = 
p(T\i)p(j\K) / J2iP(T\i)- For k- and j-independent covariances this becomes 

p(h\D) oc £ p(t m ) pi e-m^M+Ei \Ki,h)) (78) 

im 

with pi oc Y\i p(T\i) . The posterior can be written completely in likelihood form by 
using that for quadratic concepts the free energy is /i-independent F(H\Kk,t m ) = 
F(T m \Kk, h) and p(h\Kp, t) = p(t\Kp, h) under uniform prior. The /i-generating 
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energy E(h\Kp,t m ) can therefore be included in the sum over I. Hence, combining 
i, j, k, m into one multi-index i and writing t; Ll = tn, = Ku Eq.(|77|) reads 

p(h\D) OC ^ p . e -^ + \mu\K jl ,h)-F(T u \K jl ,h))_ (79) 

i 

Hence, we obtain: 

Theorem 1 (Mixture model) Alternative quadratic concepts can be imple- 
mented by the mixture model 



/ N 

E M (h) = -\np(h\D) = - In ( e 



N 



(80) 



where the component energies 

Ni Ni 

Ei(h) = J2 E{tij\K ih h)=J2 Eijih) (81) 

3 j 

are additive combinations of quadratic concepts 

Eij (h) = , 4 (h) = (h - Uj | Kij \h-tij)- (82) 

If z-dependent the normalization integrals 

a = - X> ( J dt ije -P E A + c, (83) 
j 

have to be calculated up to an arbitrary constant c so they do not interfere with 
mixture probabilities p^. The model has the stationarity condition^] 

= t M (h)-K M (h)h, (84) 

with 

K M = Y.< h ) K ^ K i = Y. K ^ a i (h)=p i e-( 3E *W +c t (85) 
* j 

and 

t M = a i(h)Kiti = Y >h(h)ii, ti = K~Hi, U = Y1 K i3 t ij- ( 86 ) 



Proof: The form of Em follows directly from Eq, (|79|) . The stationary equa- 
tion is obtained by straightforward calculation. q.e.d. 

2 If /i-dependent the a = Ci(h) additional terms arising from Sd/Sh would contribute to the 
stationary equations. 
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It is in general nonlinear and can have multiple solutions. Indeed, the model 
( |80| ) is in contrast to classical regularization functionals in general non-convex. 
The mixture model energy Em has with respect to the summation variable i 
the form of a free energy for a system at finite temperature. Notice, also, the 
difference to mixture models like they are used frequently in density estimation. 
In such approaches h is assumed as gaussian mixture. In contrast, model ( p0| ) 
represents a mixture model for the posterior density and does not restrict h to 
be a gaussian mixture. 

Remark 5 (Separate saddle point approximation for each component) Alterna- 
tively, a maximum posterior approximation (which would be exact for gaussian 
integrals) can also be applied to the i components separately. Then, however, in 
general a second minimization step has to be performed also for approximation 
problems (see Appendix jB.6|) . Notice that also for separate saddle point approx- 
imations weighting factors (det i'Q) -1 / 2 have to be calculated in case of unequal 
component covariances (see [T^JT^Ji^l and Appendix |B.5.1 ). 



Remark 6 (Low and high temperature limits) In the interpretation of error (or 
energy) minimization as saddle point approximation of a Bayesian risk (see Ap- 
pendix ||) the result becomes exact for zero temperatures, provided that only 
one dominating stationary point survives at zero temperature (see Appendix 
B.5). Thus we expect error minimization to be good at low enough temperature, 



i.e., large /3. At low temperatures, however, the stationarity equations generally 
have many solutions, making it difficult to find the dominating one. (For a more 
detailed discussion of stable low temperature solutions for the mixture model see 
[f42|). In practice one often uses annealing methods which start solving the sta- 
tionarity equations at high temperature and iteratively adapt the found solution 
to lower and lower temperatures (see Appendix |A.3.2 ). Interestingly, a saddle 



point approximation is for gaussian functions also exact at arbitrary tempera- 
tures 1//3. Figures 17 and 18 in Appendix |B| show that for high temperatures 



a sum of two gaussians with different centers becomes approximately a gaussian 
again. Thus, in that case a saddle point approximation is also a good approx- 
imation at high temperatures. More generally, one may perform a (moment or 
high temperature) expansion of the exponential in powers of (3, 

J2 Pi e-^ = (e-^) i = (l-m + ^Ef + ..^ . (87) 
Thus, at high enough temperature minimizing Em = —(H — In (e~^ Ei ) with %— 



independent c« becomes minimizing {Ei) = J2iPi E%. This is just the AND-case 
of Proposition |l]. At medium temperatures larger differences to a full Bayesian 
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approach have to be expected. (Especially at 'phase transitions' where solutions 
of the stationarity equations vary strongly with temperature.) 

3.3. An example with ambiguous prior 

Consider the following situation with ambiguous prior: Assume we want to 
implement that a function can be similar to prototype t a OR another prototype tb- 
Thus, we take the two prototypes as template functions t a (x) and tb(x). Assume 
further, we choose the same concept distances d a (h,t a ), db(h,tb) by taking for 
both templates the same concept operator K a $ = K^fl = Kq. In the following 
we consider a smoothness related sum of iterated laplacians 

^o = EM-Ay, (88) 
I 

where we understand (—A) = /. For equal mixture probabilities p(l) = p(2) we 
obtain for @ 

pM = pf + pf oc eT E * = e~^ + e~^ = e "f + e -§(<^), (89) 

with mean-square data concept d 2 D = (h — tx \ Kt \ h — tx) as in Eq.(|57|), d\ = 
(h — t a \K \ h — t a ) and analogously for b. The stationarity condition for Eq.(|89"D 
is 

h = pft l +pft 2 , (90) 

with component template averages 

h = (K T + Kq)- 1 (K T t T + K t a ) , t 2 = (K T + Ko)- 1 {K T t T + K t b ) . (91) 

Because pf 1 and p^ are not functions but only two temperature dependent convex 
mixing coefficients the solutions for arbitrary temperatures have to be on the one 
dimensional line spanned by the two single components solutions t\ and t 2 . The 
stationarity condition ( |90| ) can be rewritten 

h = t + tanh (£(122 - Eh)\ (92) 

with total template average 

t = (K T + 2K )- 1 (K T t T + K (t a + t b )) . (93) 

In the one-dimensional space of solutions spanned by the two component av- 
erages, the equation is equivalent to that of the celebrated ferromagnet. Thus, 
the two template mixture model shows the typical ferromagnetic bifurcation (re- 
lated to a 'phase transition' of the underlying physical system) switching with 
decreasing temperature from one to two stable solutions. 
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Figure 3. Example with ambiguous priors. The upper left diagram shows the two templates 
t a and tb and data Dt drawn from the interval [0, 30]. The upper right diagram shows the true 
state of Nature /ijv (thick line) in comparison with t a - The second row shows two solutions ti 
and t2 for vanishing smoothness coefficients Ai = 1, Aj = 0, i > 1. 

Fig.|I] shows a numerical example of the two-template situation with 

/ \ /3?r(a;-l)\ . . 2 /3vr(x-l)\ . 

t a {x) = -sm[ — +ai, t b (x)=sm — + a 2 , (94) 

\ m — 1 J \ m — 1 / 

with to = 40 and the constants aj adjusted so the mean over the interval [0, 40] 
becomes zero (see Fig.[|). The shown results have been obtained by the EM 
(expectation-maximization) algorithm (see Section ||). 

Fig.|5| summarizes the temperature dependence of the model. While the two 
low temperature limits j3 — > oo are given by the single component solutions ti 
and t2, the high temperature limit (3 — > is the total template average t. All 
three solutions t\, t<± and t correspond to a quadratic minimization problem (or 
a gaussian process) with therefore linear stationarity equation. 

The fact that the solutions for arbitrary temperatures are in the convex hull 
spanned by the low temperature solutions ti is easily generalized to more then 
two alternative templates with equal covariances. This is a numerical important 
observation, because the low (and high) temperature limits are for quadratic 
concepts determined by linear equations. As explained in Section [O] those linear 
equations can be solved in a n < n dimensional space given by the training 
data Dt- If the number of components is small the optimal mixture of low 
temperature solutions ti may then for example be obtained by cross-validation 
or similar techniques. Then it is not necessary to discretize the whole function 
h which is costly or even practically impossible for higher dimensional x. This 
makes such calculations feasible also for higher dimensional Xr spaces. 
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Data and templat- 




10 20 30 40 50 0.85 



Figure 4. Mixture model (p 



) + ai and tb(x) — sin 



near a bifurcation. 

2 ( 3tt(j:-1) 1 



The chosen templates are t a (x) = 
-) + a2 with m — 40 and the constants ai ad- 
justed to obtain zero mean over the interval [0,40]. 15 data points have been sampled on the 
interval [0, 30] from a function similar to t a - The value /3 = 0.105 is near the critical value shortly 
before the second solution occurs. The parameters for Kq of Eq.(^) are Ao = 0.1, Ai = a 2 Ao, 
A2 = a 4 Ao, Xi — 0, i > 2, with a — (m — l)/(37r). Row 1. Left: Shown are the templates t a , 
tb, (dashed) and the training data (dots). Right: The solution in the high temperature limit t 
(template average of data with t a and tb) /3 — > (dashes with dots) and the two low temperature 
solutions ti, t2 which average the data with one of the templates t a or tb, respectively. Row 
2. Left: The evolving solution during iteration with A — Km and relaxation factor r\ — 1 (see 
Section ^) starting from 'wrong' initial guess h{x) = tb{x). Right: The final solution compared 
with the high temperature and the two low temperature solutions. Row 3. Left: Shown are the 
squared distances of the solution h to the low and high temperature solutions t\,ti,t during 

ti — ti (thick) and 
= 1 (thin) meaning 



iteration: n t .(h) = (h — ti \Kt + Ko\ h — U) /d (tijta), for U = t\ (dashes), 



ti — t(dashes with dots)) . After one iteration step the sum n tl (h) 



that the solution moves along the one dimensional line connecting the two low temperature 
limits ti and ti. Right: The negative error —EM(h) during iteration. 



3.4- OR: Landau- Ginzburg regularization and interacting systems 

The interpretation of quadratic concepts as energies is not the only possi- 
bility. Their quantitative relation to probabilities may be different. For example, 
combination of concepts may also be modeled by fuzzy logical operations. Those 
exist in many variations but coincide on their boundaries with Boolean opera- 
tions |p5| , p6| . We choose for concept distances d(h,t) the logical interpretation of 
d 2 (h,t) = 0, i.e., h = t, as 'true' and of d 2 (h,t) = 00 as 'false'. For such variables 
a typical implementation of a logical OR is a product 

d 2 (AORB) = d\d 2 B . (95) 
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high temperature 
total template 
average 



low temperature: 
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Template 1 
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Figure 5. Scheme of temperature (/3 _1 ) dependence of solutions for the error functional E M of 
model (|89[). The two prior templates ti (template 1 = t a , template 2 = tt) and the training 
data Dt (dots) are shown on the right hand side. At low temperature two solutions exist, 
being the template average of the data Dt with either one of the templates ti or t2- Going 
to higher temperatures the less well fitting solution disappears at a critical temperature 1//3*. 
The better fitting solution survives and transforms for higher temperatures slowly into the high 
temperature solution which is the template average of all three templates: data Dt, ii and ti. 



A product implementation for alternative concepts is especially convenient be- 
cause then arbitrary combinations of quadratic concepts by (an additive) AND 
and (a multiplicative) OR are polynomial expressions in the concept distances. 
The stationarity equations of such polynomial models are easily obtained by set- 
ting the functional derivatives with respect to h to zero and given in the following 
theorem: 



Theorem 2 (Landau-Ginzburg regularization) The polynomial model 

^~EII (a* 4) = \ E P N Lt l n 4. ( 96 ) 

^ LG i=l j=l i=l 3=1 

has stationarity equation 

K LG (h)h = t LG (h). (97) 

Here 

N Ni 

K LG {h) = E E M v ( h ) K H ( 98 ) 

» 3 
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and 

N Ni 

M^EE^i^i- (") 

i 3 

with 

Ni 

M lJ (h)=P^- 1 Y[d^(h), (100) 

Mi 

and 

h-tij). (101) 

Proof: The stationary equation follows using the product rule for derivatives. 
q.e.d. 

In the high temperature limit (5lg ~^ only quadratic terms survive and 
the stationarity equation becomes linear. The polynomial model Elg resembles 
the phenomenological Landau-Ginzburg treatment of phase transitions in physics 

MM- 

Remark 7 (Mixed likelihood and posterior interpretation of error/energy) A 
product Ylj d 2 of quadratic concepts or log-probabilities does not implement a 
probabilistic OR for exclusive events i with gaussian probability according to 
d 2 . But like for Em of the mixture model, one may also try to interpret ad- 
ditive parts Ei of Elg = Y^i^il '(%Plg) as probabilistic AND for independent 
events. Hereby only terms Ej depending on only one template {tij} = {ij} can 
be interpreted as likelihood energies Ei(yi\xi, h). Indeed, for an error Elg of 
form(|96|), being composed out of squared distances dfj(h,tij), the normaliza- 
tion Z(yi\xi,h) = jdyie~^L Ei( -y i \ Xi ^ h ) for additive parts E± depending on only 
one t{ is h— and ^-independent. This can be seen using d 2 (h,t) = d 2 (h — t) = 
d 2 (t — h) = d 2 (t,h), so that also arbitrary functions g(d 2 ) of squared distances 
fulfil g(h,t) = g(d 2 j(h,t)) = g(h — t) = g(t — h). For such functions Jdh g(h — t) = 
/ dt g(t—h) = Jdt g{h—t) for unrestricted or periodic domain of integration. Thus, 
an integral Z(yrj\x D , h) depending on only one t is both, h- and t-independent. 
For terms Ei, however, depending on different tij, like for example a product d\d?,, 
this is not true in general. Within a probabilistic interpretation such terms must 
either be interpreted as posterior energy, or the h (and possibly ^-dependent 
logarithm of the normalization constant lnZ(yr)\x D ,h) has to be subtracted. 



dUh) 



h - U 



Ki 



Example 5 (The two concept model again) The two-template example of Sec- 
tion 3.3 may alternatively be parameterized as follows: 



Elg = 4 + Plg44- 



(102) 
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The stationarity equation is cubic with either one or two stable solutions depend- 
ing on /3lg- Variations are 

E LG = (d 2 D + d 2 a )(d 2 D +d 2 ), (103) 



or 



Elg = d 2 D + d 2 a + d 2 b + (3 LG (d 2 D + d 2 a df) . (104) 

In the latter formulation 0lg interpolates between OR at low temperatures and 
AND at high temperatures, similar to the situation for mixture models. For 



numerical results of the Landau-Ginzburg model for the example of Section 3.3 



sec 42 . 



3.5. OR: Combination of methods and continuous transformations 

The number of components i of a mixture model J2iPi e ~^ Ei+Ci can be too 
large to be treated exactly. Consider a set of template functions t(9), with func- 
tion values denoted t(x,9), parameterized by a continuous parameter (vector) 
9 6 @. This results in a continuous mixture model Jd9 p(9)e~P E ^ +c ^ where 
the sum J2i 1S replaced by an integral / dO . Thus, analogous to the Bayesian 
integral over h (or h, see Appendix |B|), the ^-integral has to be solved by an 
approximation. This approximation can be of low temperature type (restricting 
to the most important contributions, e.g. saddle point approximation) or high 
temperature type (starting from the mean, e.g., cumulant or moment expansion 
relying on an expansion of the exponential) or a Monte Carlo integration (ran- 
dom sampling). From the fact that the norm of a gaussian does not depend on 
its mean it follows that changing templates does not change the normalization 
constant as long as the covariance K~ l is unaltered. Hence, in cases with 9— 
independent K also the partition sum Z(9) and thus c{9) is ^-independent. For 
^-dependent K the replica method or super symmetric approaches (see Appendix 



A.4 ) can be useful in some cases. 

Two important situations have to be mentioned where continuously param- 
eterized or adaptive templates appear naturally. 

1. (Approximate structural models or combining and extending arbitrary learn- 



ing methods) We have already discussed in 2J2 that a template can be given 
by an arbitrary parameterized function, for example regression models like 
decision trees or neural networks |^7|p| ,^,^| . Then 9 denotes the parame- 
ter vector of such a model and the integral Jd9 is a Bayesian integral over 
the possible parameter values with prior probability p(9). The usual way to 
proceed is to use a learning algorithm to find an optimal approximation 9*. 
This corresponds to a saddle point approximation of the ^-integral. Including 
such templates t(9) in the regularization functional means that several differ- 
ent approximation methods can be combined and restrictions given by their 



32 



Lemm, J. C. / How to Implement A Priori Information 



parameterization can be overcome. Using a combined saddle point approxi- 
mation for h and 9 leads to stationarity equations which couple the optimal 
h* and the optimal 9*. Such a simultaneous saddle point approximation for h 
and 9 uses the same training data D{ to determine both, h* and 9*. (This can 
be compared with boosting methods which have received much attention re- 
cently J6l|.) Using parameterized function spaces t(x, 9) as adaptive templates 
corresponds to the prior assumption that the structure of the data generating 
process is approximately captured by their parameterization. For example, 
t(x, 9) can model a probable hierarchical organization of the generating pro- 
cess. 

2. (Approximate continuous symmetries) Typically, templates t{x) can appear 
in several variations t(x, 9). These variations may be described by a continuous 
parameter vector 9. For example, one may wish to include translated, rotated, 
scaled or otherwise deformed 'eye'-templates in the regularization functional. 
Again, a combined saddle point approximation can be used to find the best 
fitting variant of the template t* = t(9*) and an optimal approximation h* 
simultaneously. 

Consider now a general case of input, output, and generation noise. Fur- 
thermore let i = (ixiiy^h) be a discrete mixture variable for which we want to 
treat the summation exactly, and 9 = (9 x ,9 y ,9f l ) a continuous mixture variable 
to be treated in maximum a posteriori approximation. It follows from Eq.(71) 
that 



p(h\D) oc Jd9p(i h )p(h\i h )p(y\i y ,9y)p(i x ,9 x \x)Y[p(iy l ,9y l \i Xl ,9 Xl ,h). (105) 

i I 

Especially for quadratic concepts, 

p(h\D)(x^2 / d9p(i h ,9 h )p(h\K ihfih ,ti ht e h )p(T\i Xl 9 x )p(iy,9y\K) 



I 

To obtain an error functional for minimization we write this in terms of energies 
and free energies 

i 

xe -Py(E(y\i y ,8 y )-F(y\i y ,8 y )) e -f3 x (E(i x ,8 x \x)-F{I x ,e x \x)) 

x e ' p L E; { E ^vi > e vi K«i M ,h)-F(ii,Q Vl \i H A, .ft)) ) ( 107) 
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where 9 X G @ x , 9 y £ 6 y . In particular, for quadratic concepts, choosing (3 p = (3 L 
= (3 and writing the generation probability p(h\Kk,t m ) in likelihood form under 
uniform prior 

p(h\D) oc ]T I^K^e^Sr^ 1 ^^^^)^^^)^^^,^!^^)^))^ (10g) 



with t iyifiyi = t u (9 Xl ), K ixifixi = Ku(e yi ), and 

P{ih, 0h)p{T\i y , O y )p(i x , 9 x \x) 



J2i y Jde y p(T\i y ,6 y 



(109) 



Hence, instead of maximizing the posterior probability p{h\D) we can minimize 
an error functional 



E 



In 



/ d6 Pifi< 



-/3Ei(e)+ Ci (e,h) 



(110) 



For h-, 9-, i-independent normalization factor Z(Tn t g \Kn(9 Xl ), h) also 

d(9, h) = - In Z(Tu fivi \K u (9 Xl ), h) + c, (111) 

with arbitrary constant c, is 9~, i- or /i-independent and can thus be skipped. 
This is the case for gaussians with 9-, i-independent covariances K~ x / (3. 
Consider for example component energies E{ 

Ni d 2 - 
2 



Ei{9) — Ei^i + E it2 (9), E,^i — Y E ij2 (9) 

j 



(112) 



where we separated an ^-independent part, with squared distances 



4 



h - U 



Ki 



h-Uj), d 2 {9) = (h-t i {9)\K i \h-t i {9)). (113) 

Performing for differentiable Ei(9) the integral in saddle point approxima- 
tion and assuming p^g to be, compared to Ei{9), slowly varying at the stationary 
point, e.g. being uniform, (otherwise the derivative of p i: g has to be included) 
results in the stationarity equation 



d9 ^ 



d9 



with 



dE it2 (e) _ /du(9) 



89 
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K, 



U{9) - h 



(114) 



(115) 



This equation has to be solved simultaneously with the stationarity equation 
for h. Thus, to find the optimal h* two (sets of) coupled stationarity equations 
have to be solved. As those are usually nonlinear, a self-consistent solution has to 
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be found by iteration, starting from some initial guesses h° and 9°. The iteration 
can be performed, for example, according to the following steps: 

1. Obtaining h°: Restrict the problem first to the training space Xt, i.e., the 
space of Xk which are present in the training data Dt- For equally weighted, 
standard mean-square data terms a natural initial guess for h(xk) in that sub- 
space is the observed average of y k for the given x k , i.e., h (x k ) = — Y^ x Ui( x i) 
= txixk) with n x the number of training data for x. 

2. Obtaining 9 1 : In the second step h° is inserted into Eq. (|ll5| ), projected to 
the training subspace Xt, and iterated with initial guess 6° to obtain a new 
l optimal in Xt for that hP. 

3. Obtain full h: The intermediate solution 9 1 is then used to solve for h(x) for 
all x E Xr. 

4. Continue iteration: The stationarity equations are solved by iteration until 
self-consistency is reached. 

Consider the special case of choosing K proportional to the projector Pt into 
the space Xt of training data and using as initial guess for h the data template 
(see Eq.(^) ) h°(x k ) = tT(xk)- This is equivalent to a standard mean-square 
error minimization 

= J2^§^-(ti(x k J)-y(x k )). (116) 

k 

For example, the template ti(x,9) can represent a neural network with weights 
and biases included in 6. In that case Eq.( |116|) could be solved by backpropa- 
gation. It is important to note that in our context the resulting network ti(9*), 
optimal on the training data, is only the initial guess to be used for further itera- 
tion to obtain an optimal h* . In later stages of the iteration ti{9) can for example 
be retrained by virtual examples drawn from h{x). 

In the case of continuous symmetry transformations Si{9) the transformed 
template is given by 

UW) = Si(9)t t = e 9s %, (117) 

where 9 is the parameter vector of a continuous Lie group with vector of gen- 
erators s (see 2.3). This yields the derivative dti(9)/d9 = SiU(9) = SiSi(9)ti 
hence 

dE% QQ 6) = (siU{9) \k { \ U(9) -h) = (Si(9) Si U \Ki\h- Si(9)u) (118) 
for SiSi = SiSi. Using S~ 1 (9) = S(—9), this can also be written as 

' Si ti\S? (9)K i S i (9) (U - Si(-6)hj) . (119) 
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Adaptive 
template 




Figu re 6. The figure shows one of the templates depending on two parameters used for model 
(121). Parameter 8i describes the location of the first maximum, 82 the distance between 

maximum and minimum. 

For translations, for example, one finds with S{9)t{x) = t(x + 9) and S T {9) = 
S^ 1 {9) = S{—9) for vanishing commutator [Ki, s] = K^s — sKi = 

9))K(x,x')^. (120) 



st{6) lid t(9) ~h)= I dxdx' (t(x f ) - h{x 



Example 6 (Adaptive templates) Figure || shows a simple example of an adap- 
tive template depending on two parameters. In Figure numerical results are 
presented for a corresponding one-dimensional two template model 

E(0, 9') = - In (e-^W + g -/W)) , (m ) 

with mean-square data terms and the negative laplacian —A as concept operator 
K, i.e., 

^ ,^ lv^/ , / ^2 1 f , fd(h(x) - t 1 (x,9 1 ,9 2 ))\ 2 

£1 W = 2 E(f* " K*k)) 2 + 2j dx ( dx J ' ( } 

k 

1^, „, 1 /" , fd(h(x) - t 2 (x,9\,9'n))\ 2 , , 

E 2 (9') = -Y,(yk-H^)) 2 + ^Jdx [ { { 1 2 y 11 2jJ j . (123) 

The initial guesses 0°, 0'° have been obtained by projection into the space 
Xt according to the method discussed above. Then h and 9 have been updated 
alternately using complete search for 9 and 9' and an expectation-maximization 
(EM) algorithm for h. 



4. Learning 



The stationarity equations of the presented models are in general nonlin- 
ear, inhomogeneous integro-differential equations. One may remark, that similar 
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Figure 7. The figure presents numerical results for the model (121) with two adaptive templates 
£i(#i,#2) (shown in Figj^ and another £2(^1, #2) on a mesh with 20 points. The figures in the 
first row show 1. ten data points (dots), the true function h (dashes) from which the data have 
been sampled with gaussian distribution (with a = 0.5) and ti (thick) with 9i, 62 optimized 
for the given data, 2. the same with t2, and 3. the linearly transformed templates Wti and 
Wh. Hereby W is the derivative operator which is a concept filter for the negative laplacian. 
The second row shows for a high temperature case (here j3 = 0.1) 1. the solution h during 
iteration for initial guess h° — t\ (thick), and the high temperature limit (thin dashes). The 
high temperature limit is the template average of data and both adaptive templates, and is the 
limiting case where the OR-like mixture becomes a gaussian AND. 2. The same for initial guess 
h° — t2- 3. The final solutions for both cases and the classical solution for only one laplacian 
prior concept with zero template to = (thick dashes). Notice, that in the high temperature 
case the two solutions coincide. The third row shows the same for a low temperature case (here 
(3 — 0.5) Notice that here the two solutions evolving from the two initial guesses h° = t\ and h° 

— t2 do not coincide. 

equations appear for example in quantum mechanical scattering theory, where, 
similarly to templates or data, the inhomogeneities represent measurable asymp- 



totic states ("channels") of the system [41]. Nonlinear equations have to be solved 



by iteration ]19|,PJ7|. Consider the equation to be solved written in a form 

K(h)h = t(h). (124) 

Then an iteration procedure or learning algorithm is obtained by selecting an 
operator A, usually positive definite, and a relaxation factor rj, to be used with 
the updating rule 

h k+i = h k + v A- l (t-Kh k ). (125) 
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For T) small enough and a positive definite A the function to be minimized de- 
creases till reaching a local minimum. A may depend on the iteration step k 
and h k . The gradient algorithm, for example, is obtained when taking A = I 
equal to the identity. It does require matrix multiplication but no inversion. A 
gaussian A^ 1 on the other hand can approximate local correlations induced by 
differential operators. Choosing A = Km corresponds for error functional Em 
to the expectation-maximization (EM) algorithm and Newton's method takes 
the negative Hessian. Figure ^ compares for the mixture model (]89| ) gradient 
algorithm, an EM algorithm (labelled relaxation), and iteration with gaussian 
A^ 1 with the two different standard deviations a = 2, a = 1. It shows that 
the gradient has extreme difficulties with long range correlations. The gaussian 
A^ 1 performs well, at least at the beginning of the iteration. That means it 
captures well the covariance structure of that particular problem. This is useful, 
because in this case A -1 is given and so no inversion is needed. It can be seen 
that the gaussian algorithm especially at larger variance takes longer to adapt 
the fine structure of the function. This suggests to change the variance during 
iteration or to combine it with the gradient algorithm. The EM algorithm, which 
works here quite well, requires at every step inversion of the /i-dependent Km- 
The performance of specific algorithms is clearly problem dependent. Recently, 
multiscale or multigrid methods have been become particularly popular. 

5. Conclusion 

A new and relatively general method has been proposed to construct prob- 
lem specific error functionals (or posterior densities) utilizing complex, 'informa- 
tive' a priori knowledge. For that purpose a priori information is decomposed 
into simple components representing constraints, like measured training data or 
approximate symmetries, which the function to be approximated is expected to 
fulfil. The constraints are represented by quadratic error terms which are com- 
bined using logical operations (conjunctions and disjunctions). Conjunctions of 
quadratic concepts lead to classical quadratic regularization functionals or, in 
Bayesian interpretation, to gaussian processes. Disjunctions, representing sit- 
uations with ambiguous data or ambiguous priors, result in nonconvex models 
going beyond classical regularization and gaussian process approaches. Two vari- 
ants to treat ambiguous a priori informations have been discussed in more de- 
tail: mixture models for posterior densities and polynomial models related to the 
Landau-Ginzburg treatment of phase transitions. For simple numerical examples 
the feasibility of the approach has been demonstrated. 

The presented approach might especially be useful for complex learning 
tasks with relatively few available training data. It seems also worth to study 
its relations to knowledge transfer and combination of learning systems in more 
detail. 



38 



Lemm, J. C. / How to Implement A Priori Information 



Relaxation 





1.5 



Gaussian Gradient (a = 2) 



Gaussian Gradient (o = 1) 





Figure 8. Comparison of learning algorithms for mixture model (g9p . Shown are the first 
iterations (starting with template tb, see FigJEJ) for the four iteration matrices A = Km (EM for 
mixture model (|S9|), labelled relaxation) A — I (gradient) and two gaussian A -1 with standard 
deviations a = 2, a = 1. The gradient has obviously difficulties capturing the long distance 
correlations and requires an extremely small relaxation factor (step width) rj. The gaussian 
algorithm, which like the gradient requires no inversion, performs relatively well on a global 
scale without reaching the fine structure as fast. 



Appendix 



A. Statistics and Statistical Mechanics 



Both disciplines, statistics and statistical mechanics, deal with probabilis- 
tic models. Their differences in language and methods can be traced back to 
differences in their typical applications. 

Statistical mechanics has been developed for extremely large systems, like 
they appear in condensed matter physics. Typical systems of statistical me- 
chanics are of high regularity, defined on a two or three dimensional grid with 
local variables having the same range of possible values (e.g. ±1 for spins or real 
numbers for scalar fields), and only local interactions. There are relatively few 
prototypical systems, which have been studied extensively, for example, the cel- 
ebrated Ising model. A complex machinery has been developed to obtain results 
with very high accuracy but requiring long and costly calculations. 

Statistics, on the other hand, is mainly interested in the solution of a larger 
variety and more application oriented problems. Compared to prototypical sys- 
tems studied in statistical mechanics, the corresponding models are therefore 
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often smaller but less regular. The main practical problem consists in construct- 
ing adequate models. The models are usually not expected to allow very precise 
predictions. Hence, their is no need to achieve, by costly calculations, a numerical 
accuracy which is beyond the validity of the model. Needed are fast, flexible, and 
robust algorithms. 

However, due to the increasing computing power becoming widely available, 
the gap between the two disciplines gets smaller. Their is now a growing and 
seminal interaction between the two disciplines, like in the development of Monte 
Carlo methods or graphical models. Typical areas where methods of statistical 
mechanics can be applied easiest have a large amount of quantitative data of the 
same kind, organized on a one, two, or three dimensional grid with dependencies 
dominated by local interactions. This, for example, is the case in Bayesian image 
reconstruction or when predicting financial time series. As it becomes therefore 
important to understand the languages of both approaches we will discuss in the 
following the relations between concepts of statistics and statistical mechanics. 

We begin with some remarks concerning the construction of probability 
distributions: 

1. (Normalization constants or partition sums) The specification of a probability 
p(x) starts with unnormalized numbers Z(x). To ensure the normalization 
condition Jdxp(x) = 1 a normalization constant Z = JdxZ(x) has to be 
calculated. For simple systems like a dice this is easy. For large systems, like 
in typical systems studied in statistical mechanics, this can be a highly non- 
trivial task. Thus, unnormalized functions Z{x) instead of probabilities p(x) 
are the natural starting point for large systems. Despite the fact that the Z 
are not probabilities because not normalized we will call them in the following 
'unnormalized probabilities ' or partition sums. 

2. (Log-probabilities, information and energy) A large system is usually con- 
structed out of simpler subsystems, with the probabilities of the subsystems 
combined by multiplication, according to p(x\,X2) = p(xi)p(x2\xi). Sums, 
however, are easier to handle than products and represent a somewhat more 
intuitive concept. So it is easier to deal with infinite sums, or in the continuous 
case, with integrals, than with infinite or continuous products. This means, 
that large systems are easier constructed in terms of logarithms of probability 
lnp(x). We will show in the following how information is related to expec- 
tations of the logarithmus of probabilities and energy to expectations of the 
logarithm of unnormalized probabilities Z. 

3. (Conditional probabilities and disordered systems) Instead of directly con- 
structing a complicated probability distribution p{x) it is often easier to break 
p(x) down in simpler parts. This is done by selecting conditions y under which 
p(x\y) and p(y) are relatively easy to specify. The total probability is then ob- 
tained by combining the alternatives y according to p(x) = fdy p(y)p(x\y). For 
example, it can be more easily to specify probabilities of symptoms for given 
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specific diseases and probabilities of diseases than to write down directly a 
probability for the symptoms averaged over all diseases in one step. This, how- 
ever, also means that due to the normalization requirements Jdx P(x\y) = 1, 
a normalization constant Z(y) over x must be calculated for every y. In sta- 
tistical mechanics such models are called disordered systems. Such averages 
over energy functions occur for example for spin glasses p^,|l8|,|5l|] . 

Thus, while statistics can often be formulated directly in terms of probabili- 
ties, statistical mechanics uses a formulation in terms of logarithms of unnormal- 
ized probabilities. In the following their relations to the concepts of energy and 
free energy will be discussed. 

A. I. Probability 

Especially for large and complex systems, it is convenient to work with the 
logarithm of 'unnormalized probabilities' instead directly with probabilities. 

A. 1.1. Normalization factors or partition sums 

Let X be a random variable with possible values x G X and assume a prob- 
ability measure p(A) defined on a c-algebra of possible events A being subsets 
of a set X . The event x may be represented by a vector of real numbers. 

Denoting unnormalized probabilities by Z(x) oc p(x) we write 

PM = H (126) 
and for general events A with A C X, i.e., including A = X 

Z(A) = [ dx Z(x) oc p(A) = I dxp(x), p{A) = (127) 
JxeA JxeA 

Introducing a second random variable Y and using the compatible normalization 



z (y) = z i x i v) = J dx z ( x > y) « p(y) = J dxp(x, y)) (128) 

gives 

Z{X, y) = Jdx Jdy z(x, y) = J dy z{y) = Z(y) (129) 

and therefore 

v(v) _ Z (y) _ Z (V) and v(x\v)- P{x,y) - Z{X,V) (nQ) 

Piy) - z(y)- z(x,y) and ^)" p(y) ~ z{y) ■ ( 13 °) 

Choosing Z(y) to be not equal but proportional to Z(X,y) one obtaines 

v(x \ v ) - - z ( x >y) z (y) (m) 

P{ ly) ~ P (y) - z (y) z (x,yy (131) 
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In slight generalization of the language of statistical physics we call unnormalized 
probabilities Z also partition sums. 

A. 1.2. Log-probabilities, bit numbers, and free energies 
Log-probabilities L are defined by 

L(x) = lnp(x), p(x) = e L(x) (132) 

and for A C X 

L(A) = lnp(A), p{A) = e L( - A \ (133) 
In terms of log-probabilities L a product like 

p(x,y) =p{x)p(y\x) (134) 

becomes a sum 

e L(x,y) _ e L(x)+L(x\y) _ (135) 

Log-probabilities are widely used in practice due to the fact that it is often more 
convenient to deal with sums than with products. Also the 'quenching' effect of 
the logarithm can be important in numerical calculations when p(x) varies over 
several orders of magnitudes. 

Common are especially negative log-probabilities b also called bit numbers 

b(x) = -hxp(x), p(x) = e~ b[x) (136) 

and for A C X 

6(A) = -In p(A), p{A) = e- bi - A) . (137) 
Analogously the free energy is defined for unnormalized probabilities Z by 

F(x) = ~ In Z(x), Z{x) = Z(x\/3) = e~ pF{ - x \ p(x) = e-W*)-^). 

(138) 

and for A C X 

F(A) = ~\nZ(A), Z(A) = Z(A\P) = e-W A \ p{A) = e -^)-^W). 

(139) 

The dependency on (3 will in the following not always be denoted explicitly. For 
x G X the factor e-W is also known as Boltzmann weight of x. The reasons 
for the introduction of the parameter will be discussed later in detail. 

A. 2. Random variables 

Information and energy are averages of special random variables. We discuss 
now their connection to bit numbers and free energy. 
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A. 2.1. Averages 

Recall the definition of an expectation or average of a random function g(x) 
over X 

9x = (g(x)) x = J dxp(x)g(x). (140) 
Analogously, we define an expectation over a subset A C X 

g x {A) = (g(x,A)) A = J dxp(x\A)g(x,A) = J^dx ^g(x,A) (141) 

using p(x, A) = p{x) and p(x\A) = p(x)/p(A) for x £ A. Using a second random 
variable Y this can be generalized to a conditional expectation of a random 
function g(x,y) over event Y = y 

9xiy) = (g(x,y)) xiy = J dxp(x\y)g(x,y). (142) 

An average of form ( |141| ) is obtained by choosing in ( |142| ) a random variable Y 
taking the value y everywhere on A but not on its complement. 



A. 2. 2. Information 

A random variable C on X corresponds to a function C{x) defined for every 
x 6 X. A special example is a transformation T of p{x) which defines a cor- 
responding random variable by C(x) = T(p(x)) for x £ X. We will call in the 
following C the canonical random variable of the transformation T on X. Specif- 
ically, for every distribution p{x) the corresponding bit number b{x) = — \np{x) 
can be considered as random function with the property of being defined not only 
x— but p ( x ) -dependent. The canonical random variable on X for bit numbers, i.e., 
for the transformation — lnp(x), will be called information I(x) = b(x) = —L(x). 
(For an axiomatic approach, properties of information and the definition of re- 
lated quantities see for example P,^Jl5[|.) Accordingly, Eqs.( |136| ) can be written 

I(x) = -lnp(x), p{x) = e~ I{x) . (143) 

Like any random variable the bit number or information b(x) can be averaged 
over the whole set X, over a subset A C X, or conditioned on Y = y. One finds 
for the average or first moment of the bit number or information b the average 
information Ix 

I x (X) = (b(x)) x =-(lnp(x)) x (144) 
I x (A) = (b(x)) A =-(lnp(x)) A (145) 
l x (y) = (b(x, y)) x{y = - (\np(x, y)) ^ (146) 



In Eq.( |l45|) we used b(x, A) = b(x) for x £ A and p(x\A) lnp(x, A) = OlnO = for 



x ^ A and in Eq. (|14q ) we allowed y-dependent p(x,y). It follows in accordance 
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with the definition from Eq.( 145fl for A = {x} that Ix(A = {x}) = Ix{x) = 
I(x) = b(x) or analogously Ix(Y = X = x) = b(x,x) = b(x) from Eq. ( |146| ) . 
While / and b coincide on events x G X, in general the difference between bit 
number and average information, i.e., the difference between transformed proba- 
bility and expectations of the corresponding canonical random variable, is given 
by ' 

ix(y) - b(y) = (K x , v)) x[y - b (y) = - Qnpfa y)) x]y + fop(y) 

= -( lnPj pW) xiy = - ^lv)>*„ = M x \v)) xi , ■ ^ 

Defining the entropy (conditional information) 

H x (y) = (b(x\y)) x]y (148) 

this can be written 

Ix(y)-b(y) = H x (y). (149) 

including 

I x (A)-b(A) = H x (A). (150) 
The relations b(X) = and Ix(%) = b(x) yield the special cases 

H X {X) = I X (X), H x (x) = 0. (151) 



A. 2.3. Energy 

In the same way as b{x) also F(x) may be interpreted as random variable 
on X called energy E(x) = F(x). (E is also called (euclidian) action in field 
theory). Hence, energy is the canonical random variable of the transformation 
— jj\n(Z(X)p(x)), and one can write for Eqs.( |138) ) 

E(x) = -~]nZ(x), Z(x) = Z(x\P) = e~^ x \ p(x) = e -»)-W). 


(152) 

In general (3F can be decomposed into f3F = J2i a iEi- (This is the case, for 
example, in the grand canonical ensemble of statistical physics where, compared 
to the canonical ensemble, a component corresponding to the particle number is 
added.) 

The analogue of average informations I are then averages of the energy 
E(x) = F(x) called average energies Ex 

E X (X) = (F(x)) x = -1 (lnZ(x)) x (153) 
E X (A) = (F(x)) A = -- (In Z(x)) A , (154) 
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E x {y) = (F(x,y)) x]y =~ (In Z(x, y)) ^ . (155) 



It follows in accordance with the definition from Eq.(154) for A = {x} that 
E x (A = {x}) = E x (x) = E(x) = F(x) or analogously E X (Y = X = x) = 
F(x,x) = F(x) from Eq.( |l55[) . We will skip the subscripts X in the following. 
While free energy and energy coincide on events x S X in general their difference 
is given by 

(3E(y) - f3F(y) = - (In Z(x, y)) ^ + In Z{y) 



Choosing 



i.e., 



it follows 



Therefore 



F(y) = -i In Z(y) = -~ In fdx e'^^ (157) 

Z{y) = Z(X, y)= Jdx Z(x, y) (158) 
Z(y) = fdy Z{y) = jdx jdy Z(x, y) = Z(X, y). (159) 



and 



- (lnZ(x,y)) xiy + lnZ(y) = - (]n P (x\y)) ^ (160) 

(F(x, y)) x]y - F(y) = ± (b(x\y)) ^ . (161) 

Assuming F(x\y) here to be defined as 

1 Z(x y) 1 e -P F <y x >y) l e -PF(z,v) 



i.e., 



Z(x\y) = ^^-=p(x\y), (163) 



one can also write for Eq.(161) 



(F(x,y)) - F(y) = (F(x\y)) (164) 

\y *\y 



parallelizing the corresponding equation( |147j ) for bit numbers or informations 
b. Using the definitions of entropy H{y) = (b(x\y)) and energy E(y) = 
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(F(x } y)) x Eq.( |161| ) states that the /3-scaled difference between energy and free 
energy is the entropy 

0E(y) - (3F{y) = H{y) 

This is a generalization of the well known relation of statistical physics 

/3E(X) - f3F(X) = H(X), 

where the argument X is usually skipped. 

The following table summarizes some of the relations (recall that the variable 
y in the table can be replaced by x £ X or A C X and note that for Z = (5 = 1 
free energy and energy become identical to bit number and information): 



Transformed 
probability 


Averages of canonical 
random variable 


Difference 


bit number 

Ky) = ~ ln p{y) 


information 
I(y) = - (lnp(x)) xiv 


entropy 
H(y) = - (\np(x\y)) xiy 

= I(y) - Ky) 


free energy 
F(y) = ~hxZ(y) 


energy 
E(y)=-± (In Z(x)) xly 


entropy 
H(y) = - (\np(x\y)) xiy 

= P(E(y) - F{y)) 



A. 3. Temperature and external fieldt 



Now we look at the role of the parameter f3. Its inverse T = 1/(3 is called 
temperature in statistical mechanics. The parameter (3 can also be interpreted 
as an external source or field coupling to the conjugated random variable energy. 
The energy can thereby be subdivided into several components Ei with corre- 
sponding conjugated One Pi, for example, can be proportional to a magnetic 
field (or chemical potential, pressure, • • •) coupling to a magnetic moment Ei = M 
(or particle number, volume, • • •). Also the calculation of moments or cumulants 
of other random variables is often facilitated by introducing an external source 
coupling to that variable. 

We will discuss the following roles of 0: 

1. (3 is Lagrange parameter determining the expectation of the energy (E) = 
Jdxp(x)E(x). Its variation defines an exponential family. 

2. (3 is a homotopy parameter used by annealing methods, interpolating between 
easy and difficult to solve problems. 

3. (3 represents an external source or field coupling to the energy. The cumu- 
lants of E can be obtained as responses to a changing external field, i.e., as 
derivatives of InZ with respect to (3. For example, (E) = —(d/df3)lnZ(X). 



165 



(166) 
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A.3.1. Maximum entropy and Boltzmann-Gibbs distributions 

It is well known that minimizing the entropy and fixing normalization con- 
dition Jdxp(x) = 1 and expectations Jdxp(x)Ei(x) = Ei(X) by the Lagrange 
multiplier method yield Boltzmann-Gibbs (or generalized canonical) distribu- 
tions. Indeed, adding the constraints with Lagrange multipliers on and setting to 
zero the functional derivative of 

ff(!/)-X;«^i(l/)-«o<l>^ w (167) 

with respect to p{x\y) 

^H(y)-J2^Ei(y)-a (l) x]y 
■hip(x\y)-l-^2cxiEi(x,y)-a Q , (170) 



° = 5pJ^\yj ' H[V) " ^ aiEM " °° (1) ' (169) 



i=i 



one finds 



p{x\y) = e -E j=1 ^(^)-o-i = ^rr-, (171) 

with 

J2<XiEi(x,y) = PE(x,y) = [3F(x\y) and Z(*|y) = e a ° +1 . (172) 
i=l 

For p(x\y) = p{x) this gives Eq.( |152| ). Thus any probability distribution 
p{x) can be seen as result of a maximum entropy procedure with normalization 
constraint and fixed expectation of the energy E{x). 

A. 3. 2. Annealing methods 

The Lagrange multiplier (3, respectively the temperature T = 1/(3, deter- 
mines the average energy. Introduction of several Lagrange multipliers allows 
the fixation of several components E^ of E(x). Varying (3 defines an exponential 
family with canonical parameter (3 and and canonical statistic E. In the high 
temperature limit T —* oo, i.e., (3 — > 0, all p{x) become equal. In the low tem- 
perature limit T —* 0, i.e., [3 — > oo, only events x* with maximal p(x*) > p(x), 
\/x £ X survive, while all other events x with p(x) < p(x*), 3x* £ X are damped 
exponentially with decreasing temperature. The temperature dependency is used 
by annealing methods which are specific realizations of general homotopy or pa- 
rameter continuation methods and very important in practice [34,^4 ,|60, 73]. They 



solve a difficult (e.g. multimodal) problem at finite or zero temperature by be- 
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ginning with an easier (e.g. convex) high temperature problem and then slowly 
decrease the temperature. 

A. 3. 3. Generating functions 

Moments or cumulants of random variables can often be conveniently cal- 
culated by the use of generating functions. The nth moment M n of a random 
function g(x), with E(x) being a special case, is the expectation of its nth power 

M n (g) = (g n (x)) x , e.g. M n (E) = (E n (x)) x . (173) 

For a vector valued function g with components gi (e.g. Ei), i E 2 the moments 
become the functions (unconnected correlation functions) 

M h,i2,-,in = (9h{x)g i2 (x) ■ ■ ■ g in (x)) . (174) 



The cumulants (or connected correlation functions) are given by [20, 46| 

CW-,*« =E(- 1 ) m ~V-l)! M Jxj»,~J n M Kto,--,k n '--M h j 2 ,...j tm (175) 
v 

with inverse 

Msi,i 2 ,-,in = X/ Cji,j 2 , -,j Pl CkxM,-,k P2 " " " Ch,l2,-,l Pm (176) 
V 

where V denotes a partition of the n indizes into non-empty subsets and m is the 
number of factors in the summand and one takes Cq = 0. For a small number m 
of components i moments and cumulants may be more conveniently indexed by 
"occupation numbers" nj 

M (m , n2 ,..,n m ) = (gT^gTix) ■ ■ • (177) 

For scalar function 5, i.e., in the one component case I = {1}, we will write the 
nth moment Mx t \ t ...^ = Mr n \ = M n and nth cumulant Ci^...^ = Cr n \ = C n , 
skipping the bracket for the sake of simplicity. Hence, for a scalar g 

M = l, Mx = C x , M 2 = C 2 + (d) 2 , M 3 = C 3 + 3C 2 Cx + (Cx) 3 , (178) 

C = 0, Cx = Mx, C 2 = M 2 -(Mx) 2 , C3 = M 3 — 3M 2 Mx + 2(Mi) 3 , (179) 

where the second cumulant is the well known variance. Unlike moments, the cu- 
mulants are additive for independent subsystems, i.e., p(xx,x 2 ) = p(xx)p(x 2 ) =>■ 
C n (p(xi, x 2 )) = C n (p(xx)) + C n (p(x 2 )). Another significant property of cumu- 
lants is the possibility to set consistently C n = for all n > 2 (for gaussian 
distributions), which is not possible for moments. If, however, one C n 7^ for 
n > 2 then automatically an infinite number of other C m do also not vanish (See 
for example [pTjl ). For a multidimensional gaussian distribution with vanishing 
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means Mi = Eq.fll76j) reduces to a sum over two-pomt connected correlation 
functions (or propagators) 



Mi lt i 2 ,....i 2n — E ^hM^hM ' ' ' Cj n ,k„ 

Pairings 



(180) 



This relation is known as Wick's theorem. 

For a scalar function g(x) the moment generating function is given by 

M( 7 ) = ^M = (e 19 ^) = Jdxp(x)e^ x) 

00 „,n 



with 



1 c < - K -> n 

- dxe~^ + ^ = yl T M, t 



Ziri) = dxe-? E ^ + ^ x \ 



(181) 



(182) 



For vector g (with discrete or continuous index set X one has 7 g = YU 7*9* ( or 
/ di for continuous i) and 7 n M„ has to be understood as {(J2ili9i( x )) n ) x i i- e -> 

M( 7 ) = ( e sr^w) = e-( E^(x) ) 

n=0 \ V i / I v 



00 n 



m ni _ n 

E E 71 7n 

n=0ni,...,Bm i 



ni!n 2 ! • • • n m i 



I^"(ni,n2,---,n m )) 



E E Til ■ • ■lin M h,i2,-,in- 
n=0 ' h,—,i n 



It is easy to verify that the moments can be obtained by 



QU 

ay 



■M( 7 ) 



7 =o <9 7 r 



7=0 



(183) 



gn 
Iff' 



7=0 



<? n (z)e 79(:r) 



(g n (x))=M n (g). (184) 



7=0 



or in the multidimensional case 



QTl 

dj h ■ ■ ■ dji, 



-M( 7 ) 



<9 n 

7=0 <9 7il • • • <9 7i; 



M ilii2 ,... iin ( ff ). (185) 



7=0 
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The cumulants are generated by differentiating the logarithm lnM(7) = C(^) 



Qn 



din ■ ■ ■ d % 



(97™ 
-C7( 7 ) 



C{i) 



= |_l n / e 7 S W 
7=0 07™ \ ' v 



C n {g) 



(186) 



7=0 



In 



7=0 dj h ■ ■ ■ d% 



ln/e£i™i(z) 



In 



«l,l2,---,ln 



7=0 



(187) 

Hence, C(7) is the cumulant generating function with Taylor expansion around 
7 = 



(188) 



C(7)=lnM(7) = ln(e^)> =£J-C n ( 5 ), 

n=0 

or in the multidimensional case 

00 -I 

C( 7 )=ln( e £^)) =Erj E 7n---7 l „C n , 2 ,.,, n . (189) 



* — ■ n! . 

n=U n,...,i n 



Analogous to [3 the parameter 7 can be thought as an external source or field 
(e.g., a magnetic field) coupling to g. Even if a field 7 is not present in 'reality' 
the cumulants can still be calculated as derivatives at 7 = 0. If a nonzero field 70 
is present we can incorporate 705 in a new energy — j3E = —fi'E'+^Qg replacing a 
given [3'E' and proceed as before. Because sometimes useful, especially to obtain 
the Legendre transform of C, we will give in the following some of the formulae 
explicitly for nonzero field 70. Assuming that derivatives and integration can be 
interchanged one has 



-M( 7 ) 



7=70 



Qn 



9 n (x)e 



7=70 



Including the 70 -term in the expectation (• • •) we find for 



Z( 7o ) M( 7o ) 



xh 



(190) 
(191) 



as derivatives 

Qn 



d n Z{i) 



7=70 dj n Z(j ) 



7=70 



"0(7-70)9 \ 



9 e 



7=70 



8 n ( Jdx e -PE(x)+ 7 g(x) ■ 
Iff 1 Jdx e -PE{x)+ 10 g(x) 



(9 n (x)) xho =M nr/0 (g). 



7=70 



(192) 
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Cumulants for nonzero fields are generated by C 7o = In M, 



x 70 



gn 



7=71 



an 

In ( e (7-7o)s(:r) 



C ni71 (,7), (193) 



7=71 



where 71 = 70 is possible. Because an additive constant does not change the 
derivatives of a cumulant generating function also C (7) = In M(j) = In M 70 (7) + 
lnZ(7o) — lnZ( 7 ) = lnM 7o (7) + lnM(7o) can be used as cumulant generating 
function for 70 / 0. The expansion of the generating functions M 70 and C 70 
around 70 in powers of (7 — 70) becomes 



- /„(7-7o)9(x) 



*|7 n=0 U - 



(194) 



C 7o ( 7 ) = lnM( 7 ) = In ( e (7-7o )<?(*) 



X\l 



~ ( 7 _ 70 )« ~ ( 7 _ 71 )« 

n=0 n ' n=0 



n! 



(195) 



and analogously for the multidimensional case. Moments and cumulants for dif- 
ferent fields 70 and 71 are related according to 



M, 



z( 7l ) d n 



™,70 



9 e 



(70-71)9^ 



*|71 



00 (70-71)'- m 



Eoo 
m=0 



m+n, 71 



7=70 (e(70-7l)9) 



A-| 71 



v-oo ( 70 -7i)fc , , 
2^k=0 k\ iW fc,7l 



and 



C, 



™,70 



Qn 



■C 71 ( 7 ) 



7=70 



y (7Q-7i) m r 

/ j ^,1 tvm+n, 71) 



m=0 



m! 



i.e., for the difference 



°2, (70 - 7 X ) m 
^C n (7o,7i) = C ni7o — C„, 7l = — C m + W)71 . 



mi 



(196) 
(197) 

(198) 



Because inverse temperature f3 can be regarded as a special nonzero field, 
the moments and cumulants of the energy can be obtained by 

<>U -Z = ZM n (E) (199) 



d(-P) 



-\nZ = C n (E) 



(200) 
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like 

d 2 



<){ ^y^Z = C 2 {E) = {E\x)) x - (E(x)) 2 x . (201) 

The generalization to the multidimensional case is analogous to Eqs.( [183| ), ( |185| ). 
Equations for the second cumulants connecting the derivative with respect to an 
external field (response, dissipation) with a variance (fluctuation) are also known 
under the name dissipation-fluctuation theorems. 

Moments of a function g of x can be expressed by moments of Ax = x — xq 
by expanding g(x) around xq 

OO / \n 077, 

o(x) = £^^^o(x) =e^')g{x') , =e^g(x ), (202) 



n\ dx r ' 

n=0 



with gradient V'g(x') = (d/dx')g(x') = g^(x') and analogously (V')™g(x') = 

) . We understand here and in the following the expres- 
sion e^ x ^ or (Ax|V)™ to be "normal ordered", meaning that the derivatives 
act only to the right and not on Ax. Analogously in the multidimensional case 
for example V 2 = A creates the Hessian matrix. This yields, 

M n = (g(x)) x = (g(x ) + Axg^ixo) + g (2 \x ) + • • j (203) 

= (e^ x ^g(x )) x = g(x Q ) + (Ax) x g^(x Q ) + V 2 )(x ) + ■■■. (204) 

We have seen that an expansion in moments or cumulants depends on the 
the choice of 70 or, equivalently, on the splitting of the x -dependent terms in one 
term —f3E(x) which defines an expectation (• • •) and a field term 7g(x). Thus, 
there is a practically important freedom in choosing different moment or cumulant 
expansions as approximations for the same exponential. Assume for example that 
moments for p'(x) oc e~@ E ^ can be easily calculated. For — f3E = —P'E'+ r yg(x) 
we can then write Eq.( 196 ) for an expectation (• • ■) of a function q(x) under 

p(x) oc e -^W 

, Z' (qe^Y (0(1+70 + •••))' 
(q(x)) x = (qe-y% ^ = ^— ^ = W 19 /* , (205) 

with Z' = Jdx e~P' E '( x ^ and (• • •)' = / dxp'(x) ■ ■ ■. By expansion around xo this 
can also be expressed in terms of moments of Ax 

X +••■ 



e 79(*0) ( q ( Xo) + {Ax y Ul) ( X(J ) + g( Xo ) 75 (l) (xq) 

(q(x)) = i , X ' M V r (206) 

e 7^o)^i + (Ax)' 7 o( 1 )(x ) + - 
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where the prefactor cancels. This is the basis of saddle point approximation 
(or loop expansion) which will be discussed in Sectio n |B.5| a nd also the basis of 
importance sampling in Monte Carlo calculations PJ9|, |21| , [48[ . 

To obtain equations which go beyond a cumulant expansion it is useful to 
consider the Legendre transform of C(7). The Legendre transform (or effective 
action) T(<f>) of C(-y) is defined by requiring p9|,p5|j75|] 



r(</>) + c( 7 )-E^ 



(207) 



to be stationary with respect to variations of fields ji (coupling to gi) at <fi fixed. 
(We may remark here that C{^) and thus its Legendre transform T depends on 
the choice of gi. A typical case is = X{.) This means 



dew 



(g) xh = C il7 , 



(208) 



using also that C is the cumulant generating function. Using the chain rule one 
finds by differentiating functional (p07[) with respect to 4>i 



ar(0) 



and thus with Eq.(|20 



dr(4>) 



(209) 



(210) 



Now set 



with Ad = C iy -Ci 



1 



r ^) = E- E r <1 ,.., in ACi 1 ---AC iB) (211) 

n=l lx,— ,i n 

^i — Ci. This defines the (proper) vertex functions Fii,—,i n 



for which 



7=0 



"ln-1 



(212) 



7=0 



using Eq.( |210[) an d noting that AC = if 7 = 0. Inverting the multidimensional 
version of Eq( |198| ) (setting 71 = 0) to obtain 7 i(<^>) in terms of ACj, and defining 
amputated correlation functions 



one finds 



1^=0, 
r ■ — c~ l 

J-21,12 21,12' 



(213) 



(214) 
(215) 
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^11,12,13,14 = ~-^ii,i2,i3M y t (-^«i,i2,jCi,i-^i,i3,M 

is, iCi J Aj 14 ~\~ ■Ai lj i 4 ^CijA.j t i2 : i 3 ) , (217) 

These equations can easily be inverted 

C< 1 A=rr i i a , (218) 

^1,12,13,14 = _ 1"«1,«2,«3,«4 + (r«l,«2,«C«j'ri,«3,«4 

+r«i,i3,i^'i j-^i,i2,M ~l~ r^^^Cj jTj, 42,43) ' (220) 

Often it is only possible to find (full or perturbed) vertex functions ... t i n by 
expanding around known vertex functions ... j for a solvable (reference or 
unperturbed) system. For example, Eq.( |215D can be reformulated in terms of the 
"self energy" being the difference between perturbed and unperturbed two point 
vertex functions 

(s) il)i2 = r il)i2 -r? )i2 . (221) 

This results in 

Ch,i 2 = Ci lji2 - ^ C ilt j (X)j jfe Ck,i 2 , (222) 

which expresses the full connected two point correlation function C il5 j 2 in terms of 
an unperturbed Cf i . An approximation (£) for the self energy can be obtained 
for example by perturbation theory with respect to the unperturbed reference 
system. Then a corresponding self-consistent solution Ci lji2 can be found by 
iteration according to 

CiiM = Ci 1: i 2 —'^C^j^j^C®^ + (7°j(S) J , i fcC'° ! z(Il)i im C'^ ii2 ). (223) 

j,k j,k,l,m 



The following two Tables show generating functions for zero field 
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generating function 


derivatives 


Z( 7 ) = Jdx e -PE(x)+jg(x) 


= ZMJg) 


lnZ( 7 ) = hifdxe-PEW+tsW 


a 7 n v 


) „ = cm 

7=0 


M( 7 ) = Z( g ] = (ewW) 

\ ' X 




n = M n ( 5 ) 

7=0 


C( 7 ) = lnM( 7 ) = ln(e^W\ 

\ ' X 




n = C n (g) 

7=0 



and for nonzero field [3 or 7 o, respectively 



generating function 


derivatives 


Z = fdxe-P E W 


a(-/?')" Z(/5 


) ^ = ^M n (£) 


InZ = In jdx e-^W 




Z{i) = Jdx e -m^)+79(x) 




= Z( 7o )M„ i7o (( ? ) 

7=70 


lnZ( 7 ) = In/dxe-^W+TffW 


a 7 n w; 


— Cn,7o(5) 

7=70 


M( 7 ) = Z ^ 7) = (e^)\ 

\ ' X 


a 7 » M(7) 


= M( 70 )M ni70 (<7) 

7=70 


C( 7 ) = lnM( 7 ) = ln( e Ts(*)\ 




— Cn,7o (5) 

'=70 


M 7o (7) = |g) = ( e{7 " 7o)9W >^ |7o 


5 7 n M 7o(7) 


7=70 


C 70 ( 7 ) =lnM 70 ( 7 ) =ln(e(T-^)) 


9r C 7o(7) 


= Cn,7i (fi 1 ) 

7=71 
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A. 4- Conditional probabilities and disordered systems 

We already discussed that it is often useful to look for conditions under which 
the energy is easy to specify and to combine the different possible conditions in a 
second step to obtain the complete probability. Thus, a joint probability p(x, y) 
can be specified either by a 'joint' (or annealed) energy function E(x,y) with 
conjugated joint (or annealed) temperature l/[3 or a conditional (or quenched) 
energy E(x\y) with conjugated conditional (or quenched) temperatures B(y) and 
a mixture energy E(y) with mixture temperature b. Hence, 

e -f3E(x,y) e -bE(y) e -B{y)E(x\y) 

p{x > y) = -z{^y) = P{y)p{xly) = z(y)z(x\ y ) ' (224) 

with 

p(y) = e- bE ^/Z(y), (225) 

p( x \ y ) = e - B ^ E ^/Z(X\y), (226) 

and 

Z{X,y) = Jdx Jdye~ pE ^ y \ (227) 

Z(y) = Jdye- bE ( y l (228) 

Z(X\y) = Jdxe- B{y)E ^ y) . (229) 

One may remark, that choosing —(3E{x,y) = —bE(y) — B{y)E{x\y) produces a 
joint probability 

e -PE(x,y) e -bE(y)-B(y)E(x\y) 

P ' {X > V) = Z(X,y) = fdye-w(v)Z(X\ y y im 

which is different from p(x, y) of Eq.( |224| ). 

If interested only in variable x one integrates (marginalizes) over y. Working 
with joint energies this gives 

p(x) = jdyp(x,y) = l ( ^ y) = w , (231) 
whereas working with conditional energies yields 

, f e~ bE(y) e -B(y)E(x\y) , e -B(y)E(x\y) 

p{x) = J dyp(x, y) = Jdy z{y)z{m = J dy P (y) ■ (232) 
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In the formulation with joint probabilities E{x, y) is the canonical variable. Its 
averages can be obtained by differentiation with respect to (5 

E ann = E(X) = (E(x)) x = (E(x,y)) xy = -^\nZ{X). (233) 

In the formulation with conditional probabilities the expectation of E(x\y) can 
be obtained by differentiation with respect to B(y) 

E quen = E{X\y) = (E(X\y)) y = {E{x\y)) xy =-(^--\nZ{X\y)J . (234) 
Even for y-independent B{y) = B the averaging of \n.Z(X\y) remains 

^ x \y)) x , y = -^^ z ( x \y))y ( 235 ) 

For y-independent normalization Z{X\y) = Z(X) both approaches are equivalent 
and (In Z(X\y)) = InZ(X). In general, however, the expectations E ann and 
Equen are different. Also, despite of the equality (|224[) , the exponential families 
defined by varying the parameters (3 or B (or B(y)) are not the same. The 
conditional temperatures or fields B(y) do not influence the distribution p(y), 
while the joint temperature 1//3 does. 

In practice, for example, it may take some time after changing temperature 
or an external field until a stationary distribution p{x, y) is reached. Assume the 
dynamic of y being much slower than that of x (e.g. lower energy barriers for x 
and higher energy barriers for y). In a magnetic substance x may stand for fast 
adapting local spins and y for very slowly moving impurities. Then changing the 
physical temperature will on short time scales (or low temperatures) only change 
the distribution of the fast adapting spins x, while the slow impurities y remain 
quenched. Then the relevant physical temperature is the conditioned or quenched 
field B(x). On a much longer time scale (or at high enough temperatures, i.e., in 
an 'annealed system') also the impurities will approach an equilibrium distribu- 
tion. Then the physical temperature is a joint or annealed field (3. In addition, 
ensemble averages for variables which reach a stationary distribution on short 
enough time scales are often measured as time averages under a (quasi-)ergodic 
dynamic. In contrast, averages over slow variables can in practice not be obtained 
as time averages and must be realized as ensemble averages. 

Eq.( |235| ) requires the calculation of a partition sum Z(y,x) for every x. 
This is possible for a small number of different x values or if the x-dependence 
can be calculated analytically and the average over X be performed. In general, 
however, Eq.( |235| ) is rather difficult to solve. One possibility to proceed is using 
the identity 

Z n — 1 

lnZ=lim , (236) 

n— >0 n 
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which is verified by expanding 

Z n = e nlnZ = l + nlnZ + ^^-(lnZ) fc . (237) 



n 



k\ 

k=2 



Typically, the average over Z n for integer n is easier to perform than over InZ. 
Because Z n describes a system with n independent 'replicas' of the same system 
this approach is known under the name replica method |^4,18,27,^]. Performing 



the average over x, however, results usually in a coupling between the different 
replicas. Also one has to be careful, because calculating Z n for integer n does 
not uniquely determine the analytic continuation to n — > 0. For non-interacting 
disordered systems a super symmetric approach (expressing the two-point corre- 
lation function as a product of two gaussian integrals, one over commuting vari- 
ables and another over anticommuting (Grassmann) variables) can avoid such 
difficulties of the replica approach ( [0,^] ) . 



B. Bayesian decision theory 

B.I. Basic definitions 

B.I.I. The basic model 

Consider the following scenario^]. We assume that we can prepare a spe- 
cific situation x G X and measure outcome y G y. Furthermore, we assume 
that the probability p(y\x, h) of outcome y is determined by x and additional 
variables h which we cannot observe directly. These additional variables h will 
be called collectively 'state of Nature'. Furthermore, we assume all knowledge 
/ we have accumulated about Nature in the past has been combined in form 
of a probability density p(h\f) over the possible states of Nature h G Ti. The 
aim of learning is to update our knowledge about Nature p(h\f) — ► p(h\f'(D, /)) 
as more data D arrive under the assumption that the underlying 'true state of 
Nature' hjy producing the data does not change. 

Hence, to define the basic model formally we split the random variables of 
interest into the two groups of 

1. hidden (not directly measurable) variables h G Ti {model states, possible state 
of Nature), assuming the true state of Nature h^ is in TC, and of 

2. visible (directly measurable) variables consisting of (potential) data {x, y} and 
state of knowledge f, where 

a. the vector collects all independent variables (independent of h, may 
also be called questions, measurement devices, conditions/situations of mea- 
surement, measured quantities, observables), 

b. the vector y £ y encompasses all dependent variables (depending on h, may 
also be called answers, measurement results, responses, observed values), 
and 
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\ 

Vk ■ ■ ■ 







Figure 9. Graphical representation of a probabilistic model factorizing according to p(x, y, h\f) 

= P{h\f) H k p{xk)p{yk\x k ,h) . 

c. the state of knowledge / E T includes all determining variables (determin- 
ing h, i.e., parameterizing the probability p{h) of the /i-producing process 
and describing thus the situation under study). 

Thus, a state of Nature or model state h is described by specifying its data 
generation densities (x-conditional y-likelihoods of h) p(y\x,h). All together, 
the joint probability of the basic model factorizes according to 

P(x,y,h\f) =p(h\f)p(x\h)p(y\x,h) = p(h\f)p(x)p(y\x,h). (238) 

The variables x, y may be vectors of i.i.d. sampled (vector) variables. Repeated 
independent sampling under constant h, i.e., 

n 

v{x) = \\p{xi), p{y\x, h) = X\p{yi\xi, h), (239) 

i i 

where contain (components of) xo, gives for the example of a discrete set 

of Xi £ X 

p(x,y,h) =p(h\f) ]]_p(xi)p(yi\xi,h). (240) 

i 

Fig.|| shows a graphical representation of that model as a directed acyclic graph 

B.1.2. Learning: predictive and posterior density 

By learning we mean the change in state of knowledge / — ► f'(D, f) as new 
data D arrive. We will distinguish potentially interesting and actually known 
data: 

1. relevant or test data Dr = (x R ,y R ), x R £ Xr, y R 6 correspond to potential 
(future) application situations of interest, being thus the data we are actually 
interested in and which we want to predict, and 
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Figure 10. Typical relation between relevant data Dr, training data Dt C Dr, prior data Do 

and available data D — Dt U Do . 

2. available data D contribute to our state of knowledge about Nature. For 
practical purposes those may be further divided in 

a. training data Dt = {Dx,i\l < i < n} = {(xj,yj)|l < i < n} = (x T ,y r ), 
being an empirical sample of Dpi, i.e., a finite number of pairs (xi,yi) drawn 
i.i.d. according to p(xi)p(yi\xi, h^) for relevant Xi S Xr under the true state 
of Nature , and 

b. prior data Dq = {(xq, yo), f, S} collecting all other available knowledge (a 
priori information) not contained in the training data. Prior data can appear 
as 

i. measured prior, corresponding to measured data (xo,yo) n °t considered 
as training data, as 

ii. generative prior f, (preparation control) corresponding to knowledge 
about the (probabilistic) preparation process which generates the true 
state of Nature hjy, or as 

iii. structural prior S (model control) refering to all knowledge concerning 
the specified dependency structure of the model variables. 



Fig. 10 summarizes the relations between the different data types. 

Being interested in the relevant data the aim of learning is to find the pre- 
dictive density p(y R \x R , f'(D, /)), or more shortly, p(y R \x R ,D), after receiving 
training and prior data. Inserting the hidden variables h the predictive density 
becomes 

P{y R \x R ,f')p(y R \x R ,D) = Jdhp(h\f(D))p(y R \x R ,h). (241) 

Thus, the space T of possible states of knowledge is the convex hull of the the 
space H of possible states of Nature. The essential ingredient to be calculated 
in Eq.( |24l| ) is the posterior density p(h\f'(D, /)), or more shortly p(h\D), which 
can be obtained by obtained by inverting the model Q240[) using Bayes' rule 



p(h\D) = P(frl*°;*y) , (2 42) 
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Figure 11. The setting of Bayesian decision theory in a graphical model. Circles indicate known 
variables. In this figure the variable f x determining the probability p(x) is shown explicitly. f x 

is implicit in the other figures. 

B.I. 3. The risk functional 

Next we consider a set of possible actions a(x) G A from which we can choose 
in situation x before having seen y. The action a(x) can for example be the value 
we expect for y (regression) or a complete density p(y\x,a) (density estimation) 
we expect for y under x. Furthermore, assume we suffer loss l(x,y, a(x)) if y 
appears after we have chosen a(x). Common loss functions are for example log- 
loss l(x,y,a) = — lnp(y\x, a) for density estimation or mean square loss (y — 
a(x)) 2 , absolute loss \y — a(x)\, or 5{y — a{x)) for regression ||. 

The decision model we consider has a graphical representation shown in 



Fig.ll, where actions and loss are deterministic variables 



and 



p(l\x,y,a) = 5(1 - l(x,y,a)), 



p(a\x) = 5(a — a(x)). 



(243) 



(244) 



Decision theory aims in minimizing a functional of the loss posterior p(l\a, /). 
The most common functional chosen to be minimized is the expected risk (ex- 
pected loss) 

r («> f)= dl l(x, y, a) p(l\a, f) 



dx j dyp(x)p(y\x,f)l(x,y,a) 

dh dx dyp(x) p(y\x, h) p(h\f) l(x, y, a). (245) 



B.2. Interpretation of priors 
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B.2.1. Measured and factorial priors 

The state of knowledge before evidence has been received for data D appears 
as visible variable. This visible variable, however, must also be based on some 
information which can be considered being data. One may wish, therefore, to 
express a state of knowledge by giving explicitly the data it is based on. To do 
this, one has to include prior data in form of a measured prior (xo,yo)- Such a 
measured prior represents a situation where value yo have been found for h as 
result of a (probabilistic) measurement of property xq (e.g., smoothness). Thus, 
model (|240[) becomes 

p(x,y,h) =p(h\f)p(x )p(y \x ,h) Y[p(xi) p(yi\xu h) 

i=l 

= p(h\f)Hp(x i )p(y i \x i ,h). (246) 



Its graphical representation is shown in Fig.|12, 

Even, however, if measured priors are included the variable / (character- 
izing now a lower level generative prior) remains in the model. The question 
therefore arises what kind of / should be chosen as a natural starting point of 
learning. Consider therefore (as generative prior) a factorial prior, for which 
p(h\f) factorizes with respect to relevant data, 

P(h\f) = l[p(h k \f k ). (247) 

k=l 

The model, shown in Fig.|l3|, becomes 

p(x,y,/t|/(factorial)) = p(x p(y \x ,h) ]]_p{h k \f k )p(x k )p(y k \x k ,h k ). (248) 

k=l 

Without prior term p(yo\xo, h) this corresponds to a diagram where only variables 
with the same index i are connected, which means that receiving data for x\ would 
not allow any generalization to relevant data Djjj, 

p(,Vj\ x j, f(. D i^j)) = P(Vj\xj, /(factorial)). (249) 

A factorial prior is therefore a natural choice for a formal starting point of learning 
as it contains no information which allows generalization from training to non- 
training data. Under a factorial prior it is therefore essential for generalization 
that the value yo of prior xq depends on all hi. 

Interestingly, the asymptotic end points of learning, i.e., the pure model 
states h, also represent factorial states, 

p(y\x, h) = ~\\p{yi\xi, h), (250) 



because the variables (xi,yi) for different i are independent conditioned on h. 
Hence, in this picture learning starts and ends asymptotically in a factorial state 
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Figure 12. Graphical representation of a model with measured and with generative prior 
p{x,y,h\f) = p(h\f) p(x )p(y \x ,h) Y[ k=1 p(xk)p(yk\xk, h) . Hereby x , yo may also factorize 
into independent components. Circles indicate measured variables, i.e., the available data D. 
The predictive density for relevant or test data Dr reads p(y R \x R , f'(D, /)) = J dhp(y R \x R , h) 
p(h\D) and requires calculation of the posterior density p(h\D) oc p(y D \x D , h)p(h). The figure 
on the right shows the situation after learning where data D has been used to obtain the new 

state of knowledge /'(/, D). 



training test prior 




-t- -H 

hi ■■■ hk \ 




Figure 13. Graphical representation of a factorial prior (with respect to relevant or test 
data) p(x,y,h|/(factorial)) = p(x )p(y \x , h) ]J k=1 p(h k \f k )p(x k )p(y k \x k ,h k ) . Without y 
depending on h k no generalization is possible from training data Dijt k to test data D k . 

of knowledge. The distance from a factorial state could be measured by the 



(x-averaged) mutual information p8,15 



Mi,)= hh MxJ) mMry (251) 

Thus, starting at zero the mutual information would be expected to increase at 
the beginning of learning and to approach finally zero again asymptotically. The 
final factorial state could again be starting point of learning under a new, finer 
model H' . 
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B.2.2. Gaussian likelihoods 

A model state h is a shorthand notation for a parameter vector £ specify- 
ing the data generating densities p(y\x, h) = p(y\x, £). In this paper we mainly 
consider gaussian regression problems for which relevant and training data are 
produced by gaussians with mean specified by h and h- and x-independent vari- 
ance. For example, for one-dimensional y 

1 (y-h(x)) 2 

V{y\x,h) = -j==e (252) 

Hence, in this case h is parameterized by the regression function £(x) = h(x). 
In general density estimation problems one use a parameter function £(x,y) = 
p(y\x, h) under the additional normalization constraint J dy x) = J dyp(y\x, h) 
= 1, Vx, /i. Thus, in this case the y-likelihood of h is not gaussian in £. 

The integration / ci/i to obtain the predictive density stands for the inte- 
gration J rj m d£ m over all components £ m of parameter vector £. In case /i is 
parameterized by a regression function £ m — > as in Eq. (|252j ) the integral 
reads 

[l[dZm-> f Y[dh(x). (253) 

m x 

As far as well defined, this becomes for continuous x variable a functional integral. 
Similarly, for general density estimation problems 

[U^rn^ [Y[6(fdy'p(y'\x 1 h)-l)Y[dp(y\x,h). (254) 

m x y 

Prior data have to depend on all h{x) to allow for generalization. Analo- 
gously to the gaussian y-likelihood of ( |255| ), we will in this paper mainly consider 
gaussian prior information, e.g., for one-dimensional regression problems 



i -o(h-yo 



K(x ) 



h-yo 



p(y Q \x ,h) = (27T)-2(detK)2e 2 \ 'V, (255) 

for symmetric, positive (semi-) definite K and 



d d 



(h-y Q \K{x Q )\h-y Q ) = Y J T.^)-yo{x))K{x,x'){h(x , )-y {x 1 )). (256) 

x=l x'=l 

An infinite normalization factor in Eq. (|255| ) appearing for d — > oo will cancel 
when calculating expectations. We will use therefore the convenient integral 
notation J2 X ^-x — ¥ Jdx in the following. The variable x$ determines the kind of 
measurement for which yo is regarded as output. In the gaussian case fl255|) xq 
defines the operator K and thus the covariance K' 1 . To measure smoothness, 
for example, one may choose the laplacian K = A. The choice K(x) = \x >< x\ 
yields the standard gaussian training or test data of Eq. (regression) . 
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Figure 14. Graphical representation ol a generative prior model p(x, y, h\Do) = p(h\xo,yo) 
rifc-i P( x k)p(yk\xk, h). Here, a priori information enters as posterior density of a /i-generating 
process. The variables Qh = (xo,yo) are parameters of the state generating process. For gaus- 
sians, for example, they may determine mean and covariance structure. The posterior density 
of h under data D becomes p(h\D) oc p(y T \x T ,h)p(h\Do). 

B.2.3. Generative priors 

If the process generating is under explicit control it is natural to model 
prior data as generative prior. For example, consider a situation where h is 
produced by a gaussian density 



p(fo\xo,y°) = (27r)-HdetK)h~ h(h ^ h ' 



K(x ) 



h-yo 



(257) 

with mean yo and covariance K~ l under control. Then the corresponding gener- 
ative prior model is shown in Fig.|l4|, 

B.2.4- More complex models 

The basic model of Section ( B.l.l] ) is completely general. It can however be 
useful to implement additional structure. Input noise, for example, corresponds 
to a decomposition 



p(y\x,h) = / d6 x p(y\9 x ,h)p(9 x \x) 



(258) 

For a finite number of discrete 9 X and gaussian components p{y\9 x ,h) this con- 
stitutes for p(y\x, h) a mixture of gaussians with varying mean. Similarly, output 
noise 

p(y\x, h) = J d9 y p(y\9 y )p(9 y \x, h), (259) 

may be used to construct gaussian mixtures with varying covariances. Analo- 
gously, generation noise 



p(h\f)= d9 hP (h\9 h )p(9 h \f), 



(260) 
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Figure 15. Graphical representation of a model including besides h additional hidden 
variables p(x,y,h) = p(x,y,h,8 Xo ,6 V0 ,6 h ) = p{6 h )p(h\6 h ) p(x )p(y \9 X0 ) p(9 yo \9 xg , h)p(9 X0 \x ) 
rifc=i P( x k)p(yk\xk, h). The additional hidden variables 9 = {9 xo ,9 yo ,9h) parameterize input, 
output, and generation noise. The posterior density of h requires marginalization (integra- 
tion) over 9 and becomes p(h\D) oc J d9 X0 d9 yo d9h p^la^,^) p(yo\9 yo ) p(9 yo \9 xo ,h) p(9 xo \xo) 

p(h\9 f0 ) p(9 f0 ) . 



can be used to obtain for example a mixture of gaussians with varying mean and 
covariance. 

The variables 8 = {6 X) 6 y ,6h) represent additional hidden variables of the 
model. Restricting for convenience the variables denoted by h to that hidden 
variables which determine the relevant data (relevant state of Nature), an (ex- 
tended/complete) state of nature h is given by specifying h = (h, 0). A graphical 
representation of a model with additional noise variables 9 = {6 X , 9 y , Oh) is shown 
in Fig. 15. 



B.2.5. Structural priors and a posteriori control 

We have up to now discussed how prior data can enter a model in the form of 
empirically measured data or as generative prior controlling the preparation of /. 
A third possibility consists in the direct specification of the necessary dependency 
relations between relevant and training data within the model. 

The empirical validity a a priori information implemented by a specific model 
is based the possibility to control the required dependency structure between 
relevant and training data (model control). As those dependencies have to be 
controled at the time of testing (which is usually after having received training 
data) the validity of a priori information can be ensured by a posteriori control. 
As a trivial example consider a bound y R < c which can be enforced by ignoring 
values larger than c at the time of testing. 

Thus, it is useful to note that a device producing y with probability density 
p(y\x,h) has a 'passive' interpretation as measurement device or an 'active' in- 
terpretation as control device. In the passive interpretation the device is said to 
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measure the property x of Nature h, in the active interpretation y is said to be 
produced or controled by preparing x, but modified by unknown parameters h. 
The active interpretation is evident if the probability density of y is determined 
by a function g of x and y only 

p(y\x,h) = g(x,y), 

which in the extreme case may even be deterministic 

p(y\x,h) = 5(y - g(x)) 

Next, consider a control device 

p(y\x,h) =g(x,y,£) (263) 

depending on some parameters £ which are unknown but fixed for all x. Thus, the 
parameters £ represent the unknown state of Nature h and the control device can 
be said to be a (indirect) measurement device for £. Such a situation is shown 
in Figjl^. Because hereby prior information is implemented by choosing a model 
with specific structure this may be called a structural prior. 

The number of parameters £ may be very large. Consider, for example, a 
control device producing y according to a gaussian density with known variance 
and mean similar but probably not equal to t{x). Assuming now the true but 
unknown mean = h{x) is an empirical realization of a gaussian process 
with mean t(x) and covariance K~ l would result in an quadratic error term 
(1/2) < h — t\K\h — t >. Thus, in this interpretation the template function 
t(x) characterizes an average control device. We have already discussed that 
an approximative control device t(x) is often provided by human experts. For 
example they may specify what constituents define an image to be a face. Looking 
for h(x) in the neighborhood of t{x) approximate control devices t(x) can than 
be refined by training data. 

Summarizing, a measurement device can be seen as control device with un- 
known but constant parameters with a priori information specifying the amount 
of control or knowledge we have about that device. 

B.3. Quantum Mechanics 

B.3.1. Density operators 

The formalism can also be applied to quantum mechanical problems and 
can be used to solve inverse quantum mechanical problems, i.e., problems where 
empirical, (e.g., scattering) data are used to determine the Hamiltonian of a 
system. In quantum mechanics a system is specified by a density operator p 
describing the state of the system. An observable x is represented by a hermitian 
operator and its eigenvalues are the possible measurements results y. As a 

3 We used Oa; instead of x to denote the operator representing a general observable, because x 
is used in quantum mechanics to denote the multiplication operator in coordinate space. 
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training test 




Figure 16. Graphical representation ol a structural prior model p(x,y,h) = p{x T ) p(x R ) 
PitfrW,h) p(y R \x R ,h) l\ n p(h„\f n ) with posterior p(h\D) oc p(ft)p(y r \x r , h) . 

measurement in quantum mechanics changes p, repeated measurement under the 
same p requires a new preparation of p before each measurement. 

In particular, the probability of measuring value y for observable x repre- 
sented by an hermitian operator O x under density operator p is given by 

n y 

p(y\x,h) = p(y\O x , p) = Tv(P yP ) = J2 < Vj\ PlVj >, (264) 

j 

where and 

fly 

p» = Eto><i/ii ( 265 ) 
j 

is the projector ( with \yj >< y,j\ denoting the dyadic product ) into the sub- 
space of (orthonormalized) eigenfunctions yj of the (hermitian) operator O x with 
eigenvalue y 

O x \Vj >= y\Vj >, (vM) = SjiS(y - y'). (266) 

The indices distinguish n y different eigenvectors with equal eigenvalues y. For 
nondegenerate eigenvalues n y = 1. The density operator characterizing the sys- 
tem is a hermitian, positive definite operator with eigenfunction decomposition 
(the sum to be replaced by an integral for non discrete cases) 

p = ^2p(i) \<Pi X <Pi\ (267) 

i 

with < pi < 1 and X^pW = 1 an d orthonormal (ipi\ipj) = Sij. In case the 
<Pi are the eigenstates yi of O x the p(i) become p(y\O x , p). A density operator 
which is a projector, 

p 2 = p= \<p><<p\, (268) 
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consists of only one term and is called a pure (quantum mechanical) state. For a 
pure state 

n y 

p(y\x,h) = Tr(j2\Vj >< Vjlf >< Pi 



= E (v\vi) (yM = E I (vM I 2 = E l^')l 2 - (269) 

j j j 

Pure states already show all (non classical) quantum phenomena while statistical 
mixtures with p 2 ^ p just add a classical (non quantum mechanical) averaging 
to quantum mechanical systems. 

Consider a system with p(to) prepared at time to- To calculate the density 
operator p(t) of that system at a later time t > to of measurement it is necessary 
to study the time dependence of the system. The time development of a quantum 
mechanical system is given by a unitary evolution operator U (unitary means U _1 
= with denoting the hermitian conjugate of U) 

\<p(t)>=V(t,to)\<p(to)>, (270) 

leading for density operators to 

p(t)=U(t,toMfo)Ut(Mo). (271) 

The evolution operator is usually expressed by a Hamiltonian operator H 

U(t,t ) = e - i(t - to)H , (272) 

for time-independent Hamiltonian H (and disregarding a factor K) or (at least 
formally) by 

U(t,to) = Te-< Hit)dt , (273) 

for time-dependent Hamiltonian H(t) with T denoting the time-ordering opera- 
tor (see for example |2S.4£.7j| ). 

B.3.2. Quantum statistics 

Stationary density operators 

p = J2Pij\ E ij>< E ij\ (274) 

ij 

with p{t) = p(to) are built from (orthonormalized) eigenstates \Eij> of the Hamil- 
tonian 

H|E y >=^|£^>, (275) 
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the index j distinguishing eigenstates with equal eigenvalue. For example, a 
canonical ensemble at temperature 1/(5 for a Hamiltonian H(£) depending on 
parameters £ is given by the density operator Q 

p(H(0) = — ^- = ^-gHK) = ^ ^ ^ <-' (,) 

yielding 

Here p has been expressed by eigenfunctions of H by expanding the exponential 
and inserting the eigenfunction expansion 

n El 

H = J2J2 E i\ E v ><E v\- ( 278 ) 

i 3 

(Replace the sum by an integral if H has a continuous spectrum.) 

A Bayesian approach now uses the likelihood ( |277| ) to update a given prior 
density p(£) to a new posterior density p(£\D) after new data D became available. 
Because the measurement of a quantum mechanical system changes p, repeated 
data for the same p requires the repeated preparation of the canonical ensemble 
before each measurement. If learning about £ is intended the canonical ensem- 
bles must hereby correspond to fixed (but unknown) £. This can simply mean 
waiting long enough between two measurements until a given system with fixed 
but unknown hamiltonian (described by £) is again in thermal equilibrium. 



B.3.3. Quantum mechanical scattering 

As a second example we prepare a pure state \z(to)> at time to with z\z>= 
z\z>. This corresponds to the density operator at time t 

p{t) = \z{t)xz(t)\ = U(t, t )\z(to)><z{t )\U^{t, t ). (279) 

Measuring then at time t a non degenerated eigenvalue y of observable y has 
probability 

p(y\x,h)=p(y\O x ,p(t)) = \z(t,y)\ 2 =< y{t)\z{t)><z(t)\y(t) > 

= | (y(t)\z(t)) | 2 = | (y(t)\U(t, to)z(t )) | 2 . (280) 

In scattering theory one takes the limit t — > oo and to ~~ > ~ °° an d assumes the 
asymptotic states lim^oo y(t) and limt ^_oo z(to) converge (in the weak sense) 
to ('free) states y° or z , respectively, fulfilling 

H in |zJ >= z°\z° >, BUIi/J >= y°||/} > . (281) 
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(We will skip the degeneration index j assuming there exist non-degenerate (i.e., 
unique) states \z° > and < In case the eigenvalues z° or y° are degenerated 
the states can be made unique by measuring additional commuting observables 
commuting also with Hj n or H ou t, respectively. For non- unique states, see below.) 
One obtains 

p{y\x,h) = lim p{y\O x ,p(t)) = lim | (y(t)\z(t)) | 2 = lim | (y(t)\V(t,t )z{t )} | 2 , 

t— >CO t— >OC t—>oo 

tQ — y — oo 

(282) 

or, inserting the free in (initial, preparation) states \z°> and out (final, measured) 
states <y°|, 



lim p(y\O x ,p(t,z(to)))= (y»Sz°(0) 



(283) 



which defines the scattering operator 

S= lim uL(t,0)U(t,t )U m (t ,0), 



(284) 



with (for time-independent Hj n , H 



out ) 



-i(t-to)H in 



U ou t(t,t ) = e 



-i(t-t )H ov 



(285) 



Often only a partial measurement is performed which does not allow to identify 
a unique final state y° (or y). Then there is a set A of y° which can not be 
distinguished by the measurement. Also, often the preparation is not a pure 
state but a mixture of states z° G B (or z G B, possibly also with varying 
preparation observables z or Hj„) with probability p(z°). In that case one has to 
sum over out states in A and average over in states in B 

y&A heB 



= E E^>(y|cw)= E E*°) 

y°&Az°£B y°eAz°eB 

Eq. [286| links quantum mechanics to the Bayesian framework and allows to deter- 
mine a system Hamiltonian H(£) from scattering experiments given a prior p(^) 
over parameters £. 



y°(0) S(0*°(0) 



(286) 



B.3.4- Disordered systems 

A general density operator p may be constructed as a mixture of compo- 
nents p(H(£)) which are stationary with respect different H(£). Consider, for 
example, a situation where the preparation of a system with stationary density 
with respect to a constant (but possibly unknown) H(£) is not possible before 
each measurement. Assuming that at least a constant (but possibly unknown) 
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p(h)= jd#p(0\h)pQ*(0)), 

SO that (see Section p.2.4| ) 

p(y\x,h) = / cM p(y\x,h, , &) = 



mixture can be prepared density operators representing states of Nature p(h) 
have the form 

(287) 

d&p(0\h)p(y\O x ,p(H(&)). (288) 

There are many cases where it is easier to specify a system in a conditional (disor- 
dered) form ( |287| ) than to give directly the joint density p(y, h). However, as 
discussed in Section |A.4| , technical complications arise because Eq.( |287| ) requires 
calculation of possibly ^-dependent normalizations Z$ of p(H(i?)) for all In 
such situations where the likelihood is defined as a mixture of conditional den- 
sities p(y\x, h, i?) and we want to emphazise the need to deal with -^-dependent 
Z$ we will also speak of a disordered system. Note however, that written in its 
eigenbasis the density operator for disordered systems takes again the form of 
Eq.( |267j ), and vice versa a general p of that form is a mixture of components 
\<Pi >< (fi\. If such a formulation has been found, however, one already has Z% = 
Tr(|y>j >< <fi\) = 1 for orthonormalized tfi. 

We remark that most problems studied in textbooks of quantum mechanics 
(or, analogously, of quantum field theory) assume a given p and aim in calculating 
p(y\O x , p). Hereby p can, for example, be a pure (quantum mechanical) station- 
ary state (bound state problems), a pure (quantum mechanical) non-stationary 
state (scattering with completely determined initial state), a stationary mixture 
(e.g., a system at finite temperature), a mixture of conditional p(#) (disordered 
system), or, equivalently but differently specified, a general non-stationary mix- 
ture (e.g., scattering with not completely observed initial states). For such prob- 
lems with fixed p no learning can occur. To allow learning a space of possible p 
together with a prior density p{p) must be specified which is updated to obtain 
a posterior p(p\D) after new data D have been received. The following table 
shows possible forms of density operators with y>i(H) denoting orthonormalized 
eigenfunctions of H 





P 


stationary pure state 
general pure state 
stationary mixture state 
disordered mixture state 

general mixture state 


|¥>(H) X p(H)| 
\<p >< <p\ 
EiP(i|H) MH) >< ^(H)| 

EiPO') \<pi >< <Pi\ 



B.3.5. Path integrals 

The path integral approach provides an alternative to the operator formalism 
for quantum mechanics or quantum field theory, respectively [ 25, 58, 75, 46 1. For 
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example, a density operator of a canonical ensemble ( B7q ) can be expressed as a 
path or functional integral 

<y\P\y>=^Jdn fd ( f>MyWy)e^ dT f^( i ^- H ^), (289) 

with 



dir # e/oW^-§7^(^)). (290) 

J(j> =(l>p 

Hereby H (ir, (ft) is a classical function depending on classical fields (ft and tt, and 
JdTrd(ft denotes a functional integral over functions tt(x,t) and (ft(x,r). The 
function H corresponds to a Hamiltonian 

H= fdxH(Tt,4>), (291) 



expressed in terms of field operators (ft(x, r) and their canonical conjugates 
7t(x,t). For H which are quadratic in tt the 7r-integral is gaussian and can 
be performed analytically. For more details see for example |33| . 

Calculating the functional integral in saddle point approximation yields the 
classical field equations. Such a saddle point approximation can be combined 
with the saddle point approximation for the /i-integral which will be discussed 



in Section |B.5| |65] 



B.4- B ayes' rule for complete data 

B.4-1- Posterior probabilities and likelihoods 

Typically, model states h are defined by giving their data generating proba- 
bilities or likelihoods p{D\h). The posterior probabilities p(h\f) = p(h\D) we are 
interested in are related to the likelihoods p(D\h) by Bayes' rule 

P (h\ f) - p(h\D) - P^ h ^ - pMQpW (2Q2) 
pwn pwv)- p{D) - j dhp{Dlh)p{h y i^j 

Restricting this Bayesian inversion for the moment to training data one finds 

(h \ D \ = P( D T\h)p(h\D ) UiP(yi\xi,h)p(h\D ) 
P{ 1 ' jdhp(D T \h)p(h\D ) IdhUiP(yi\xi,h)p(h\D o y [ ' 

where the training data term has been written as product over all training data 
D T = {x T ,y T ) = {(xi,yi)\l <i <n} (x T , yj, denoting the vectors of x i} yi), 

p(L>T\h) =p ( | ^ =T\p(yi\xi,h). (294) 
P{x T ) 
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Hence the effective distribution under the posterior state of knowledge / becomes 
the quotient of two correlation functions 

where {g(h)) denotes the prior average fdhp(h\Do)g(h). 

In the following prior data Dq will be treated analogously to training 
data Dt- We also assume enough prior data so that all information lead- 
ing to nonuniform p(h\D) is contained in D. Then, p(h) is uniform (possibly 
improper, i.e., non-normalizable) and thus /i-independent, so that p(h,D) = 
p(y D \x D , h)p(h)p(x D ) oc p(y D \x D , h)p(x D ). We call such data D = Dt U Dq com- 
plete. 

The (training and prior data) generating probability or (complete) likelihood 
p(D\h) is related to the posterior probability according to Eq. fl292j ) 

v(h\D) - v(h\v x ) - P(ynK^)p( h K) _ p(y D \x D ,h)p(h\x D ) 
which becomes for complete data or uniform p(h\x D ) 

p(m = viyjx h) 

Jdhp{y D \x D ,h) 

B.4-2. Posterior and likelihood energies 

Let us now introduce posterior energy E(h\f) , posterior temperature 1//3 P , 
with corresponding free energy F(.Fo|/) an d partition sum Z(.Fo|/) to write the 
posterior 

P {h\f) = P (h\D) = |^ = e-^imm-Fwm) , (29 8) 

with 

Z{h\D) = e-Pp E ^ D \ Z(H\D) = Jdhe-^^W = e~^ F W D \ (299) 

Analogously, the Xj-conditional training likelihood p(yi\xi,h) becomes in 
terms of (conditional) training likelihood energy E(yi\xi, h) and (conditional) 
training likelihood temperature 1/Pt 

pte M = e-^( B <*M- F Q>M) = ° ^ . ' , (300) 
with free energy 

F(y\xi,h) = ~z(y\xi,h), (301) 

Pl 
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and normalization factor Z(y\xi, h) over responses yi £ y for given h and Xi 

Z(y\xi,h)= [ d yi e-^ E (Vi\^ h ). (302) 

Jy 

The complete conditional likelihood p(y D \x D , h) reads in terms of (complete, 
conditional) likelihood energy E{y D \x D ,h) and (complete, conditional) likelihood 
temperature l/(3 L 

-f3 L E(Y\x D h) 

p(y D \x D ,h) = e -Wy D \x D ,h)-F(y\ XD ,h)) = ? (303) 

with E(y D \x D , h) = Et + Eq and Et = E r p{y T \ x T , h), Eq = E(yo\xo, h, Dt), free 
energy 

FQ>\x D ,h) = -jZ(y\x D ,h), (304) 
and normalization factor over data y D for given state h 

Z(y\x D ,h)= f dy D e-^ E (y D K' h ). (305) 

Jy 

B.5. Saddle point approximation 

B.5.1. Maximum a posteriori approximation 

An exact analytical solution of the full integral in r(a, /) is most times not 
possible and approximations have to be used. We also remark that for functions 
h(x) of continuous variables x, like in field theory in physics, the integral / dh 
is a functional (or path) integral. Such integrals have typically to be defined by 
perturbation theory, starting from a well defined, e.g. gaussian, case. Alterna- 
tively, the integral can also be discretized, like it is done in lattice field theory 

and like we will do for numerical calculations. 

All approximations can only calculate a finite number of solvable terms. 
A solvable term can correspond to a solvable infinite sum or, e.g., a gaussian, 
integral. Low temperature approximations restrict the evaluation to the most 
important terms (thus they replace integration by maximization), high tempera- 
ture approximations start from a case where all terms are equal (e.g., a cumulant 
expansion starts with the mean), while Monte Carlo integration evaluates a ran- 
dom sample of terms (so that under certain conditions the sampled sum converges 
to the true result). We will discuss in the following the maximum posterior ap- 
proximation, which is a special variant of a saddle point approximation |l4L[l2 ]. 

The /i-integral within the risk r(a, /) ( |245| ) involves two /i-dependent factors 

[dhp(h\f)p(y\x,h) = [dhe-bW h M- F W) e -P»Am**)-ny\*M (30 6) 
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with posterior temperature l/f3 p and posterior energy E(h\f) = E(h\D), which 
will in the following be written simply E(h). For a Taylor expansion of the 
energy E(h) with respect to h around a minimum h* of E(h), the first order 
terms vanish and the Hessian H, i.e., the matrix of second derivatives is positive 
definite. Hence, 

E(h) = e^ h - h *\^E{h*) (307) 

1 00 1 

= E(h*) + - (Ah \H\ Ah) + ]T - (Ah\V) n E{h)\ h=h , , 

n=J, 

with H the positive definite Hessian of E(h) at h* 

d 2 E{h) 



H(h) 



dh(x, y)dh(x',y') 



(308) 
h=h* 



and V acting on E(h) but not on Ah = h — h*. 

Now assume p(y\x, h) is slowly varying at the stationary point (i.e., it has a 
/^-independent Taylor expansion at the stationary point at h*) and approximate 
it by its value p(y\x, h*) at h* . Then the second order term results in a gaussian 
with mean h* and the Hessian of E(h) at h* as inverse covariance matrix H. 
Diagonalizing the positive definite H by an orthogonal transformation O 

H = T DO, (309) 

changing the integration variables from (g-dimensional) h to g according to 

Ah = y/fciVDOyig (310) 

with Jacobian 

j = det ( ^rl) = ^ det (311) 
\dg(y)J 

the (/-dimensional gaussian integral can be performed analytically 

Jdhe-^( Ah \ H \ Ah ) = Jiti = (det IT)"* [jf ■ (312) 

Thus in case we restrict to the contribution of only the lowest minimum of E* , 
i.e., assuming other local minima of E to be sufficiently larger than E* for a given 
f3 p , we have for (5 P large enough (and no other stationary points contribute) 

/ - 

Jdhp(h\f)p(y\x,h) w (detff)-s (J^y efr F W e -fh> E i h ') p (y\ x ,h*). (313) 

The factors beside p(y\x, h*) are h, x, y and a-independent and can therefore 
be skipped from the risk in case only one minimum of E(h) contributes to the h- 
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integral. Moreover, evaluating e ^p f ^ = Z(7i) = fdhe Pp E ( h \f) also by saddle 
point approximation one finds for (3 P large enough 

e f3 p F(H) = J dhe -P p E(h\f) „ (detH) l (J_y* e f3 p E(h*) (314) 



Z(H) 



so the factors in Eq. (|313[) cancel yielding 



dhp(h\f)p(y\x,h) = p (y\ x ,h*) + 0(^). (315) 

fi P 



(For the justification of the symbol 0(4-) see Section B.5.3j .) 

A maximum posterior approximation of the h integral in r(a, f) minimizes 
instead of the expected risk r(a, /) of fl245| ) the risk 



r(a,h*) = J dx J dyp(x)p(y\x,h*)l(x,y,a) (316) 

for 

h* = axgmax heH p(h\f) = axgmm heH E(h\f), (317) 

which assumes slowly varying p(y\x,h) at h*. Hence, integration is replaced by 
minimization of E(h\f) (maximization of p(h\f)). 

For multiple minima of E(h) which are well enough separated the volume 
factor | det H\~i can be used as weighting factor for the contributions of different 
minima. Well enough separated, means that the gaussian approximations around 
the different maxima do not considerably overlap. Note however, that in this case 
the (nontrivial) calculation of the Hessian is required. If this is not possible, the 
ratio det H\/ det H2 = det HiH^ 1 of the determinant of two Hessians H\, H2 can 
be approximated by expanding the logarithm according to 

00 (_-i\n-l 

ln(l + x) = Y ± '- x n (318) 



n 

n=l 



around the identity matrix I in A = H\H 2 1 — I 

debH 1 H; 1 = e' Ith * I+ *). (319) 

A graphical algorithm for expansion of a determinant is given by the polymer 
expansion p6[ |. 

For the case of minima with overlapping contributions sec ||[l^.[nj . 

B.5.2. Complete maximum likelihood approximation 

In this Section the saddle point approximation will be discussed in the 
limit of small (conditional) likelihood temperatures f/f3 L instead of posterior 
temperatures 1//3 P . It should be stressed, that in contrast to a (conditional) 
maximum likelihood approximation which maximizes the likelihood p(y T \x T ,h) 
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= JJiP(yi\xi, h) for training data we consider here the complete (conditional) 
likelihood p(y D \x D , h) = p(y T U y P \x D , h) containing training and prior data. The 
two approximations are equivalent only for uniform p(h\Do). 

To be a bit more general, we also include continuous hidden variables 9 and 
discrete hidden variables i. In the following the integral over 9 will be treated 
in saddle point approximation analogous to the /i-integral while the (finite) sum 
over i will be treated exactly. Then, one finds for complete data, i.e., uniform, 
possibly improper, p(h) for the predictive density 



p(y\x, D)=J dhp(h\D) p(y\x, h) (320) 
_ Jdhp(y D \x D ,h)p(y\x,h) 



Jdhp(y D \x D ,h) 

Jdh Jd9 T,iP{y D ^^\ x D^ h )p(y\ x ^ h ) 



(321) 
(322) 



Jdh Jd9 Y,iP{y D ^ 9 ^\ x D^ h ) 

Y L $ dh J d9 YjP(yD^ 6 ^\ x D^ h )p(y\ x ^ h ) ( 323 ) 

introducing Zi = Jdh Jd9 ^2iP{y D -,9,i\x D ,h). Decomposing 9 = (9 x ,9 y ) and i 



= (ix,iy) in an input and an output noise component (see Section B.2.4 ), this 
yields 

p(y\x,D) = -^-Jdh Jd9 ^p(9 x ,i x \x D )p{y D \9 y ,i y ) 

xp(9 y , i y \x D ,h, 9 x ,i x )p(y\x, h), (324) 

The likelihood may be written in terms of (the complete, 9 X — , ^-conditional) like- 
lihood energy E(9 y ,i y \9 x ,i x , h) and corresponding likelihood temperature 1//3 L 

p(9 y ,i y \x D ,h,9 x ,i x ) = e -P L (E(9y,i y \e x , ix ,h)-F(e y ,i y \e x ,i x ,h))^ ^ 

with F(Q y ,I y \9 x ,i x ,h) the free energy corresponding to normalization over 9 y 
and i y . Assuming small enough likelihood temperature 1/(3 L to calculate the h 
and 9 integrals in saddle point approximation the contributions from numerator 
and denominator Zl cancel yielding, like Eq.(|315|) 



dhp(h\D)p(y\x, h) w p(y\x, h*). (326) 

If the normalization Z(@ y ,I y \9 x ,i x , h) over 9 y and i y is i x -, 9 X -, /i-independent 
then h* = h*(9*) with 

/»* = argmax fc ^p(^,i a! | ab )p(^|^,i w ) e -^ B ^' < »l^ h ), (327) 

i 

r = argmax^p(^,^|x D K^|^,^) e -^ s ^l^^^), (328) 
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which are solutions of the coupled stationary equations 

d_ 
dh 

0_ 

09 



= ^E^'^I^M^I^'%) e " /3im * iyl ^' ia; ' /l) ' ( 329 ) 

i 

= iEK«^k)p(!bl»^K 4%<,|WW ' (330) 



If 

ix~ > &x~ j or h— dependent, normalization terms Z(@y,Iy\9 x ,i x ,K) also appear 
in the stationarity equations. 

In the case of n training data with likelihood energy being the sum 

n 1 n 

Y, Ej(Vj\xj,h) = n ~J2 E AVi\ x h h )= n ( E j) tiainine (331) 
j j 

one can choose 

4 = n, (332) 

provided additional prior information ensures that the expectation (l/n)^-^? 
= n (Ej) converges for large n. In that case the saddle point approximation 

*■' training 

is a large n approximation. 

B.5.3. Loop expansion 

One can go beyond a maximum posterior approximation and include higher 
order contributions. We will discuss in the following the expansion in posterior 
temperatures. For that purpose, we will write a full expectation over p = p(x\y, h) 
as a sum of gaussian expectations, i.e., symbolically, 

( p } fuu = a* ( p e R )^ = a */p(l + R + ^- + --)\ , (333) 

* ^ ' ' gauss 

where a* = e -$* B < h *>(7ra /(JZ) is a maximum a posteriori result. The expan- 
sion of the exponential e R corresponds to a moment expansion analogous to 
Eqs.( 183 , |190 ). This expectation can be expressed in terms of moments of h by 
expanding p and the remaining term R around h*. This will result in an expan- 
sion in powers of 1/(3 P . Expanding also Z in the denominator common prefactors 
cancel 

+ + •••)) 

(PLn= /. - D2 x— ^ (334) 



l + R + ^- + 



Hence, we begin by expanding p(y\x, h) around a stationary point h* 

OO -i 

p(y\x,h) =e^ h ^p(y\x,h)\ h=h ,. = £ - (Ah\ V) n p(y\x, h)\ h ^. (335) 



n=0 
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Changing for H = O t DO the integration variables from htog with Ah = h — h* 
= yJ]3 p ~(\fDO)~ 1 g and Jacobian J of Eq.(|311|), we find 



dhp(h\f)p(y\x,h) 
= Udh (e< Ah ^p(y\x,h*)) e-^ e{AhlV)E ^) 
| e -/W) J dh ff- ±. { Ah\Vr P (y\x,h*)\ 

\m=0 ' / 



xe 



-P p (±(Ah\H\Ah) + j:^ 3 ±{Ah\vrE(h*)) 



}_ e -p P E(h*) f dhe -^(Ah\H\Ah) ( y 1 /Ah\V) m p(y\x,h*)\ 

k=0 K - \n=3 n - ) 



JZ 



e -W) f dge -^ Y.^ L lE{(^DO)- 1 g \v) m p{y\xX 
■> \m=o m\f3 P 2 

oc . / oc i \ k 

XzZu[-E-^l(^0)- l g\v) n E(h*)) , 



fc=o \ n=3 n!/3 r 2 



1 /• ,,,000000 oo / -t \k -i 

1 „-p p E(h*) f^.-lslBL V- V- V- V- (- 1 ) 1 



-e ^ 



'(V^O)-^|v) m Ky|^ h*j) f[ (((^0)^^)"' , (336) 

i 

with normalization factor 

JZ = e- /3 p F(w) /3 P f (det^)i (337) 

The normalization integral Z can be treated analogously in saddle point approx- 
imation leading in first order to the cancellation of the prefactor. The individual 
terms are moments of a multidimensional gaussian and can be evaluated using 
Wick's theorem ( [1801) . Because the gaussian has mean zero odd moments vanish 
and Eq.( |336| ) is an expansion in 1//3 P , also known as loop expansion. Only if not 
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expanding around a saddle point linear terms would survive. Higher order terms 
are usually represented as Feynman diagrams [ 29 J^, f^]5^]4"9||75| , 3C . 40j, 10j, 46 ] . 
Surprisingly, the presence of the denominator e~^p^^ = Z(TC) leads to a simpli- 
fication. It can be shown that it cancels exactly the so called vacuum diagrams. 
Further simplifications arise for expanding the cumulant generating functional 
W = In Z where only connected diagrams contribute or its Legendre transform T 
where only amputated and one-particle irreducible diagrams have to be consid- 
ered, r is the generating function of so called (proper) vertex functions. Equa- 
tions connecting moments, cumulants, and vertex functions of different order 
can be obtained (e.g., Ward-identities, equations of motion or Dyson-Schwinger 



equations). Solving Eq.(223), for example, corresponds to summing up infinite 
subclasses of diagrams. The name loop expansion stems from the fact that ex- 
pansion of Z (which does not contain p(y\x, h)) or V in powers of (3~ l is equivalent 
to an expansion in the number of loops of the corresponding Feynman diagrams. 

B.5.4- Stationarity equations 

A specific state h is given by a parameter vector £ determining p(y\x,h), 
\/x £ X^/y G y. Thus, the stationarity equation to be solved for an a-posteriori 
approximation is of the form 

^=0, Vm, (338) 

where E(h) is an exponent to be minimized. 

In the present paper we use equally weighted quadratic error terms for train- 
ing data assuming gaussian p(y\x, h) according (1252 ) with equal variance (and 
therefore equal normalization) for all x and mean specified by a regression func- 
tion h(x). Then the stationarity equation is obtained by setting the functional 
derivatives 

Hg-0, V«*«. (339) 

In the more general case, like in density estimation where one wants to allow 
for non-gaussian densities p(y\h) (then also the data terms \n.p{yi\h) in the error 
functional become non-quadratic), arbitrary p(y\x,h) can be used, as long as 
they fulfil the normalization conditions 

fl f , I =0, VxeX R ,yey with fdy P (y\x,h) = l,VxeX R ,he H. 
op(y\x,n) J 

(340) 

In that case the posterior probability can be maximized with respect to p(y\x, h) 
using standard techniques of constraint optimization Jl9"|,|3]j7| . The normalization 
conditions can be implemented by <5-functions 

p(h\f) OC e-P&i E(ViM+E(h\Do)) S ( I dy e -pE(y\x,h)+c(x) _ ^ (341) 
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where p(h\D ) oc e -^(^l^o) ; p ( x \y,h) oc e ^ E ^ x ' h \ c(x) = -In Z(y\x,h) = 
(1/ P)F(y\x, h), /i-independent factors have been skipped, and we have written 
(3 L = p. Using the Fourier representation of the 5-function 

1 POO 1 MOO 

S( X ) = J- dke ikx = ^- dke~ kx , (342) 
2vr J-oo 2-ki J-ioo 

and performing the A integral by a saddle point approximation (which is exact 
for the delta function) yields 

p{h\f) OC e -/3(E l S (^l-^)+ S (' l l B o))+/^A*(a ; )(l-/d y e-^^l- h )+^)) > (343) 

Here the Lagrange parameter function A*(x) denotes the stationary value of 
A(x), i.e., a solution of the saddle point condition dP(h\f)/dA(x) = 0. It is easy 
to see that this stationarity condition which A has to fulfil at the end of the 
optimization procedure is equivalent to the normalization constraint. There exist 
many standard iteration procedures to perform the maximization of ( |343 ). 



Alternatively, the normalization constraint can be implemented by writing 

and solve for a stationary point g(x,y,h*) 
acv g{x,y,h) \ 

- 0, Vx G Xji, y £ y. (345) 



dg(x,y,h) 

This is equivalent to the insertion of the 5-functions in Eq. (|341] ) and using always 
the stationary value A*(x) during iteration. 

B.6. Approximation problems 

In the most common kinds of problems the hypothesis space is identified 
with the model space A = 7i. Let an approximation problem be defined as a 
situation with A = TC and log-loss 

b(x)l(x, y,a = h) + c(x, y) = — lnp(y\x, h) = [5E{y\x, h) (346) 

with a-independent coefficients c(x, y) and b(x) > 0. Notice that the probability 
p(y\x,h) is that governing the production of test data and is in principle not 
related to the posterior p(h\D) or likelihood probabilities p(y D \x D , h). Often, 
however, training data are also assumed to be produced according to the same 
p(y\x, h). For log-loss it is easy to see, by using Jensen's inequality, that 

a* = argmin ag ^r(a, h*) = h* , (347) 

and thus 



p(y\x,a*) =p(y\x,h*). 



(348) 
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P(h If) 



argmax p( h | f } 



P(h |f) 



argmax p( h | f ) 




Figure 17. 



Figure 18. 



The figures compare the maximum posterior approximation (top row) with the mode of the 
posterior (second row) and a full Bayesian risk minimization (third row) for the model defined 
by Eq.(|5l|j35|,|5|). (Fig.[t7| for d = 0.1 and Fig.[jJ| for d = 0.5.) Row 1. Left: p(h\f) 
(unnormalized) for —1.3 < h < 1.3 and < P < 10. Right: argmaxhgHP(^|/) f° r < /3 < 10. 
The second local maximum which appears for larger /3 (low temperature) is also shown. Row 
2. Left: p(y\f) (unnormalized) for —3.3 < y < 3.3 and < (3 < 10. Right: argmaxj /6 yp(j/|/) 
for < P < 10. Row 3: Left: r(a,f) for -4 < a < 4 and < f3 < 10. Right: argmin ae ^r(a, /) 
for < P < 10. Row 3: Left: The contributions of —3 < y < 3 to the full risk for a = and 
< P < 10. Right: All three approximations within the same diagram. 

Thus, for approximation problems the maximum posterior approximation h* gives 
already a consistent solution a = h* . In this case the energy function of the 
posterior probability p(h\f) to be minimized is also called error function, and the 
maximum a posteriori approximation is equivalent to empirical risk minimization 
[|^,^J6^]. In the typical case of gaussian states p(y\x, h) the log-loss is up to a 
constant proportional to the squared error (y — a(x)) 2 . 

A similar result holds for the full Bayesian approach. Choosing actions a 
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from the space A = T of states of knowledge / instead from the model space Tt 
of states of Nature h (J 7 is the convex hull of H), and using log-loss ( |346| ) gives 

a* = argmin a6 _4r(a, /) = /, (349) 

and thus 

p(y\x,a*) =p(y\x,f). (350) 

Other choices are the squared error (y — a(x)) 2 for which the mean of y over the 
predictive distribution p(y\x,f) is optimal, absolute error loss \y — a(x)\ which 
leads to the median, and zero-one loss 1 — 8y )a (x) resulting in the mode to be 
optimal. For gaussian distributions mean, median, and mode coincide. 

For non-approximation problems calculating the maximum posterior solu- 
tion h* (or the state of knowledge /, respectively) has to be followed by the 
second step of finding the optimal a* which minimizes r(a,h*) (or r(a, /)) for 
given h* (or /) |4^| . For squared error but non-gaussian p(y\x,h*), for exam- 
ple, the second minimization step would require the calculation of the mean of y 
under p(y\x, h*). 

There are typical non-approximation situations. Complexity costs, for ex- 
ample, which penalize a complex a (e.g. due to computational resources or to 
facilitate comprehensibility) are typical for non-approximation problems. Only 
when there are reasons to believe that also the real state of Nature h fulfils such 
complexity constraints this results in an approximation problem. This means, one 
has to distinguish between complexity costs and knowledge about true simplicity 
in nature. 

Related is the distinction between a generative and a reconstruction model. 
Consider the face detection task of Example |2| in Section 2.1. Hereby, a recon- 



struction model p(y\x, h) would specify a probability of face vs. non-face given 
an image, while a generative model p(x\y, h) specifies the probability of an image 
provided it represents a face or non-face. The roles of x and y as independent and 
dependent variables are herein exchanged. Looking for an optimal y = a{x) can 
be formulated as approximation problem for the reconstruction model. However, 
the probability of representing a face does depend on all non-faces which can 
appear. Hence changing the set of non-faces the reconstruction model has to be 
adapted. The specification of the inverse generative model is often easier. Here 
the generative model for faces p(x\y, h) does not have to be changed if the class 
of non-faces changes. Modeling physical processes of the mapping from objects 
to images also falls in this class. But a generative model does not define an ap- 
proximation problem for the function a which returns an answer y = a{x) for a 
given x and not vice versa. Then a maximum posteriori approximation requires 
a second minimization. 

Finally, Figs.^jl^ show the relations between the optimal a for a full 
Bayesian risk r(a, /) and a maximum posterior approximation (or empirical risk 
minimization) for a simple situation with only one x -value, i.e., X = {x}, so x 
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and be skipped from the notation. Assumed are gaussian model states 

parameterized by their mean h £ TC = IR 



2ir 



p(y\h) 

real numbers as possible actions a £ A = TC = JR with log-loss 
Kv,a) = {y~ af -- 



|lnp(y|ft = a) + iln|. 



(351) 



(352) 



Data y D of the form d AND (b OR c) with b = 1, c = —1 have been represented 
by a gaussian mixture with equally weighted components 



p(y D \f) 



P_ p ~MI,-,l^ 2 ( f -^l,-\ r 

2ir 



, - (e~w- x > +e"2 



This means for complete data, i.e., uniform prior p(h) 



P(h\f) 



p(y D \ h ) 



p(y D \ h ) 



Jdhp(y D \h) 
The expected risk reads 



ly^e-f^-iP+e-^+D 2 ) 



r(a,f) = / dyp(y\f)l(y,a) 



(353) 



(354) 



(355) 



with predictive distribution 



P{y\f) = / dhp(y\h)p(h\f). 



(356) 

Shown are the /3-dependency of 1. the maximum posterior solutions h* = 
aigmaxh^-j-ip(h\f) which for the given approximation situation corresponds to an 
empirical risk minimization a* = argmin ag _4£'(a) for error E(a) = — lnp(a\f), 2. 
the mode of the predictive distribution argmax^gyp^ | /) , and 3. the full Bayesian 
solution a* Bayes = argmin ag- 4r(a, /). In Fig]!?] the 'data' term d = 0.1 is nearly 
in the middle between the two alternative templates 6 = 1 and c = — 1, while in 
Fig. 18 the 'data' term d = 0.5 is much nearer to b = 1. Compared to Fig.[l7] the 
second local maximum of the posterior probability p(h\f) appears at larger (3 (or 
lower temperature) in Fig. 18, where also the maximum posterior approximation 
is better. Indeed it is well known from physics that mean field (maximum pos- 
terior) approximations break down near phase transitions. On the other hand, 
using an adapted (3 1 unequal to the true (3 an improved solution can be obtained 
by a maximum posterior approximation. 
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